\documentclass[man,12pt]{apa}
\usepackage{apa}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{multirow}
\usepackage{url}
\usepackage[latin1]{inputenc} 
\usepackage{caption}
\usepackage{subcaption}
%\usepackage[nogin]{Sweave}

\author{Alexander L. Davis*}
\affiliation{Department of Social and Decision Sciences\\
Carnegie Mellon University}
\note{\begin{flushleft} 
*E-mail address: alexander.l.davis1@gmail.com, Department of Social and Decision Sciences, Carnegie Mellon University, Pittsburgh PA 15213, USA, 412-216-2040. \end{flushleft}}

\acknowledgements{All materials and data can be obtained from the first author's Dataverse at \url{http://hdl.handle.net/1902.1/18699}. Open lab notebook can be obtained at \url{http://openwetware.org/wiki/User:Alexander_L._Davis}.  We would like to thank NSF for the dissertation enhancement grant. Thank you to John Sperger and Terence Einhorn for help collecting the data}

\title{Incentives, Error, and Data Sharing}
\shorttitle{Incentives, Error, and Data Sharing}

\abstract{This research examines when individuals solving Wason's 2-4-6 rule discovery task attribute disconfirming feedback to error, and whether they decide to share perceived errors with another person trying to solve the same problem.  Participants invoked error for disconfirming feedback more than for affirming feedback.  Fixed-response data sharing decisions during the task (Experiment Two) and at the end of the task (Experiments Three and Four), found that participants were less likely to share trials when feedback was disconfirming or when trials were attributed to error, even after controlling for actual error.  Experiments Two through Four found that incentives had no effects on rule discovery or attributions to error.  However, the perverse incentive used in Experiments Three and Four gave participants a financial motive to suppress data, but they did the opposite, deciding to share unconvincing data that could cost them money.  Overall, decisions to share data were reasonable, depended too much on whether the data were perceived to be in error, and were not corrupted by incentives to hide disconfirming data.}

%These attributions to error were inaccurate when there was no financial penalty for inaccuracy (Experiments One and Two), but accurate when there was a financial penalty (Experiments Three and Four).  

\begin{document}


%\section{Submission Questions}
%\emph{1. What will the reader of this article learn about psychology that she or he did not (or could not) have known before?}

%The \emph{positive test strategy} is a hypothesis-testing strategy that seeks affirmation and discounts disconfirmation.  The article demonstrates that this strategy extends to communication of data, such that data are less likely to be shared with another participant who can use them if they are disconfirming or perceived to be in error.  Furthermore, incentives for discovery or for sharing data have no effect on this positive test strategy or data sharing, but penalties for incorrect judgments improve the accuracy and consistency of error attributions.  Finally, decisions not to share data are determined more by perception of error than by actual error, as perceptions are inappropriately influenced by whether feedback was disconfirming or affirming, even after controlling for actual error.

%\emph{2. Why is that knowledge important for the field?}

%So far there is very little research on what data people consider worthy of sharing with others and why.  This research shows that people perceive disconfirming data to be caused by error, and in turn, should not be shared with others.  These results establish some of the social-cognitive context of scientific reasoning.  Furthermore, the results suggest that scientists may need training to help them develop effective data sharing policies when they get disconfirming results, and that penalties for inaccurate judgments can increase their ability to identify error.  Even with these penalties, training is needed to help scientists understand the effect of feedback on their data sharing policies.

%\emph{3. How are the claims made in the article justified by the methods used?}

%The Wason 2-4-6 rule discovery task is the classic paradigm for studying the positive test strategy.  It provides a simulation of hypothesis testing that is simple and reflects the core features of hypothesis testing and data sharing, with affirming or disconfirming feedback, the use of logical reasoning and insight, and probabilistic reasoning when there is the possibility of error.  The data sharing task provides a simple method of measuring data sharing, and all measures were verified with open-ended comprehension checks.

%The main text counts, as do notes (or footnotes), acknowledgments, and appendices. The abstract, references, material in tables, and figure legends do not count. 
%At most 40 references and three figures/tables.

\newpage

\maketitle
Hypothesis testing has been found to follow a \emph{positive test strategy}.  Researchers collect data that they expect to conform to their prior beliefs and then exaggerate its information value \cite{klayman1987confirmation}, while discounting any inconsistent evidence that comes their way \cite{dunbar1995scientists,dunbar2001scientific,gorman2005scientific,lord1979biased}. This result has been found with both simple experimental tasks and in dynamic artificial environments, such as simulated molecular biology \cite{dunbar1993concept}, programmed robots \cite{klahr1988dual}, and multiple-cue probability learning \cite{o1989effects}.  Similar patterns have been found in scientific laboratories.  For example, in an observational study of a biological sciences laboratory, \citeA{dunbar2001scientific} found that scientists did not immediately reject their hypotheses after they were contradicted by data.  Rather, their first reaction was to invoke experimental error \cite{dunbar1995scientists,dunbar2001scientific,gorman2005scientific}.  In thirty-seven experimental treatments conducted by one biologist, twenty-one had unexpected results, most of which were treated as errors \cite{dunbar2001scientific}.

After inspecting their data for errors, researchers must decide whether to communicate any data that they consider flawed.  If the researchers' attributions to error are accurate, then omitting these errors from published reports may avoid distracting readers.  If they are inaccurate, then failing to publish those data will allow false theories to emerge and persist.  Justified or not, data attributed to error are unlikely to be shared.  For example, statistical significance is often (incorrectly) interpreted as the probability of error in data, and is usually a necessary condition for publication \cite<e.g.,>{fanelli2012negative,sterling1959publication}.

Data sharing decisions are not only affected by whether the data are perceived to be faulty, but also by professional rewards for publishing positive (usually statistically significant) results \cite{angell2000academic,bodenheimer2000uneasy,nathan1999academia,weatherall2000academia}.  These rewards can produce a healthy motivation to make a discovery, such as finding a successful anti-cancer drug.  However, rewards may also undermine accurate data inspection by increasing scrutiny of results that indicate the discovery is false, while simultaneously making affirming results a wanted relief from the pressure to produce \cite{kunda1990case}.  Supporting this account, there is evidence that higher rewards for publishing are associated with publication bias \cite{ioannidis2005early,fanelli2010pressures}.

Here we present four experiments on decisions to share possibly faulty data.  We use Wason's 2-4-6 rule-discovery task \cite{wason1960failure}.  It asks participants to discover the rule that generated a set of numbers (2, 4, 6), by proposing a new set of three numbers (a proposed triple), then getting feedback as to whether the numbers that they proposed fit the rule.  We use \citeA{penner1996trust}'s version, in which participants are told that some percentage of the time, the feedback will be false---a feature that adds something like the uncertainty that is inevitable with scientific inferences. 

We add several new features to the task.  (a) Before receiving feedback, participants assess the probability that it will affirm their expectations.  (b) After receiving it, they indicate whether they would share each trial, including the feedback, with a second person trying to discover the same rule. In Experiment One, the sharing decision is done at the end of the task.  In Experiment Two, the data sharing decision is done immediately after participants make their error judgments.  In Experiments Three and Four, it is done both after each trial and at the end of the task.  (c) We also use two types of incentives intended to simulate the rewards that may lead to motivated reasoning.  Experiment Two provides participants with a large incentive (\$100) for correctly guessing the rule, and a small incentive (\$1) for concluding that they do not know the answer.  Experiments Three and Four provide participants with an incentive to convince a matched participant that they discovered the rule, whether or not they actually did.

Using this task, we first replicate the finding that error is more likely to be invoked with disconfirming than with affirming feedback \cite{penner1996trust,gorman1986possibility,gorman1989error}.  We then examine whether these attributions to error are justified using two evaluative criteria: (a)  \emph{accuracy}, defined as whether the judgments are correct; and (b) \emph{Bayesian consistency}, defined as attributing feedback to error if and only if either: (i) participants strongly expected the triple they proposed to fit the rule but it did not, or (ii) participants strongly expected the triple they proposed to not fit the rule but it did.\footnote{More precisely, using Bayes' Rule it can be shown that feedback should be attributed to error whenever one believes that there is greater than an 80\% prior probability that the triple fit the rule, but the feedback indicates it does not fit, or conversely one believes there is less than a 20\% prior probability that the triple fit the rule, but the feedback indicates that it does fit.}  \emph{Selective reporting} is the degree to which trials attributed to error are not shared with the matched participant, compared to those attributed to other sources.  

\section{Experiment One}
Experiment One looks at whether attributions to error are consistent with prior beliefs, whether they correspond to actual error, and whether trials are less likely to be shared with another person when the feedback is attributed to error.  We used the Wason 2-4-6 rule discovery task with feedback error \cite{penner1996trust}.  

\subsection{Method}
\subsubsection{Participants}
Eighteen Carnegie Mellon University undergraduates completed the task for course credit.
%\footnote{A random-effects meta-analysis of the effect of disconfirming feedback on attributions to error from Penner and Klahr's Study One (numerical broad and narrow conditions), Penner and Klahr's Study Two (numerical/narrow) \cite{penner1996trust} and Gorman \cite{gorman1986possibility}, indicated an overall effect size of Hedges' $G=0.43$, 95\% CI [0.33, 0.53].  The sample sizes of the control groups in these studies were 15, 15, 25, and 24, respectively, indicating eighteen participants should provide sufficient power to detect the effect of feedback on error attribution.}  They were on average 21 years old (range: 18 -- 38).  There were 7 women.  One participant gave no valid responses. 

\subsubsection{Procedure}
Participants were seated at a computer, asked to sign an informed consent document, and then instructed that they had 30 minutes to complete the task.  Participants completed the task online as a Qualtrics questionnaire with embedded Javascript used for feedback.  Each page (trial) of the questionnaire had the same format.  In order, participants proposed a rule, proposed a new triple, assessed the probability that the triple they proposed fit the Actual Rule, received feedback, judged whether the feedback reflected error, and then decided if they wanted to give their Final Answer.  They were reminded to record all responses both on the computer and on the spreadsheet they were given.  After participants decided to stop new trials and give their Final Answer, or thirty minutes had passed, they were asked to review their spreadsheet and mark the trials that they thought should be shared, in order to help a new participant solve the problem.

\subsubsection{Materials}
The materials were a modification of \citeA{penner1996trust}'s version of the Wason 2-4-6 rule discovery task (see \url{http://hdl.handle.net/1902.1/18699} for full materials).

%, with the study introduction rewritten to increase readability and comprehension, based on pretesting using cognitive interviews.  The study also used computerized, rather than hand-written, feedback.

\subsubsection{Introduction}

Participants were shown the following introduction on the computer along with a separate paper copy as a reminder:

\begin{quote}
``You will be given three numbers that are related somehow. For example: 3, 5, and 15. This is called a triple. There are many possible rules that could relate these three numbers. We have selected only one of them. The rule that we selected is called the Actual Rule. You will not be given the Actual Rule. Your task is to discover it. The initial triple on the next page is an example drawn from the Actual Rule.''

``Our study is using several versions of this task.  Yours is a particularly difficult one.  Sometimes, even if your Proposed Triple FITs the Actual Rule, the computer may output that it DOES NOT FIT. Conversely, sometimes, when your Proposed Triple DOES NOT FIT the rule, the computer may output that it FITs. On any trial there is a 20\% chance that you will get false feedback. For each trial if you think false feedback occurred mark ``F'' in the ``Feedback'' column on your spreadsheet. If you think true feedback occurred, mark ``T'' in the ``Feedback'' column.''

``At any time you may try to guess the Actual Rule that we selected. This is called the Final Answer. You only get one Final Answer and it may be wrong. Once you make your Final Answer you can no longer get feedback from the computer and the experiment will end.''
\end{quote}

\subsubsection{Initial Triple}
At the top of each page, the initial triple (2,4,6) was shown.  Participants were told:
\begin{quote}
``The initial triple above is an example drawn from the Actual Rule.''
\end{quote}

\subsubsection{Proposed Triple}
After writing their best explanation of the initial triple, they were instructed to propose a new triple:
\begin{quote}
``You may propose additional triples to help you discover the Actual Rule. The computer will tell you whether the triple you proposed fits the Actual Rule. Record all information on the spreadsheet you were given.  Write one number of your triple in each box below.''
\end{quote}

\subsubsection{Prior Probability}

On each trial, before they received feedback, participants assigned a probability that the triple they proposed fit the actual rule, by answering the following question:

\begin{quote}
``What is the probability that the triple you proposed fits the Actual Rule? (must be a number between 0 and 100)'' 
\end{quote}

\begin{flushleft}
We denote this $P(TFTR)$ for `[P]robability that the [T]riple [F]its [T]he actual [R]ule'.
\end{flushleft}

\subsubsection{Attribution to Error}

Immediately after receiving feedback that the triple fit (FIT) or did not fit (DNF) the rule, participants judged whether they thought the feedback was due to error:

\begin{quote}
``Do you think this feedback was true or false? (True/False)''
\end{quote}

\subsubsection{Final Answer}
After participants felt they had completed enough trials, or the 30--minute window expired, they were asked to make their Final Answer:
\begin{quote}
``Write your Final Answer for the Actual Rule in the box below (it can be mathematical or in words).''
\end{quote}

\subsubsection{Data sharing}

Participants then decided which trials they wanted to share with a new participant:

\begin{quote}
``In this experiment, a trial is a page where you proposed a rule, a triple, a probability estimate, received feedback, and judged whether you thought the feedback was false or true.'' 

``In a future experiment we will have a new participant try to discover the same rule you tried to discover. 

You can choose trials that you think will help him or her solve the rule. 
For each trial you indicate, all of the information would be shared, including: 
\begin{enumerate}
\item your rule
\item the proposed triple
\item your probability estimate
\item the feedback
\item whether you thought the feedback was false or true
\end{enumerate}

In the space below, please indicate the trials you conducted that you think would help this person.'' 
\end{quote}

\subsection{Results}
Unless otherwise noted, all estimation was done using hierarchical logistic models with subject-level varying intercepts \cite{gelman2007data,gelman2010arm}.  The model assumes that multiple observations from the same person are conditionally independent given the subject-specific intercept.  Tests, standard errors, and p-values based on these models were calculated using non-parametric bootstrap with 200 simulations per statistic \cite{efron1993introduction}.

\subsubsection{Performance}
Participants completed a median of eight trials.  Each participant's task performance score was determined by their final answer, scored on a 5--point scale awarding one point for each element of the rule that they had discovered.  The five elements were: 1) even numbers, 2) consecutive numbers, 3) ascending numbers, 4) the lower bound is 2, and 5) the upper bound is 100.  
%All six participants who scored zero used a mathematical formula that was either unspecific (e.g., $x+2$) or not a rule (e.g., $(2+6)/2=4$).  Among the seven participants with a score of 1, six included even numbers in their answer, and one mentioned ascending numbers.  Of the four participants who scored 2 on the task, two mentioned consecutive even numbers, and two mentioned ascending evens.  The one participant who scored a 3 on the task guessed sequential even numbers less than 100. 

\subsubsection{Attributions to Error}
Replicating \citeA{penner1996trust}, participants judged disconfirming feedback to be error more often (38\%; $SE=$ 5.9\%) than affirming feedback (8.6\%; $SE=$ 4.4\%), $t(162)=3.77$ $p<0.05$, $d=$ 0.30.  In multiple regression, there was only a main effect of feedback type on attributions of error ($t(159)=$ 3.1, $p=$ 0.0075), with no significant main effect of actual error ($t(159)=$ 1, $p=$ 0.46) or interaction between the two factors ($t(159)=$ 0.72, $p=$ 0.62).  

\subsubsection{Bayesian Consistency}
%\begin{figure}[h] \pause
%    \centering
%\scalebox{1}{\includegraphics[\textwidth]{was1cons}}
%\caption{Proportion of trials attributed to error depending on whether Bayes' Rule dictates participants should attribute the trial to error and whether the feedback was affirming or disconfirming.}
%\label{fig:was1fig}
%\end{figure}

These error attributions were consistent with prior beliefs.  When participants received disconfirming feedback, they correctly attributed 11 of 12 trials to error when they strongly expected the triple to fit the rule beforehand ($P(TFTR)>0.8$), and incorrectly attributed 15 of 65 trials to error when the strength of their prior beliefs did not justify attributing the feedback to error, ($P(TFTR)<0.8$), $\chi^{2}(1) = 23$, $p<0.05$, $\phi = 0.5$.  When receiving affirming feedback, they correctly attributed 1 of 4 trials to error when they strongly expected the triple to not fit the rule beforehand ($P(TFTR)<0.2$), and 4 of 63 incorrectly when the strength of their prior beliefs did not justify attributing the feedback to error ($P(TFTR)>0.2$), $\chi^{2}(1)=$ 1, $p=$ 0.31, $\phi = 0.12$.  However, in multiple regression, there was a main effect of feedback type on attributions of error ($t(159)=$ 3, $p=$ 0.0095), no significant main effect of Bayes' Rule requiring error attribution ($t(159)=$ 1.2, $p=$ 0.37), and no interaction between the two factors ($t(159)=$ 1.4, $p=$ 0.3).  The overall correlation between their judgments and the consistency criterion was $\phi=$ 0.38, $\chi^{2}(1)=$ 23, $p<0.05$.

\subsubsection{Accuracy}
Although error attributions were consistent with prior beliefs, they did not match actual error.  When participants believed that feedback was false, it was as likely to be accurate as inaccurate (23\% vs. 29\%), $\chi^{2}(1)=$ 2.6, $p=$ 0.35, $\phi=$ 0.11.

\subsubsection{Data Sharing}
Participants were as likely to share data when feedback affirmed their hypothesis as when it did not, (40\%; $SE=$ 11\% vs. 34\%; $SE=$ 12\%), $t(122)=$ 0.48, $p>$ 0.05.  They were also equally likely to share feedback when they saw it as accurate or inaccurate (40\%, $SE=$ 10\% vs. 32\%, $SE=$ 10\%), $t(122)=$ 0.53, $p>$ 0.05.  When including both main effects and the interaction between actual error and attribution of error to predict whether each trial would be shared, there was neither a significant main effect of error attribution ($t(119)=$ 0.55, $p=$ 0.68), actual error ($t(119)=$ 0.045, $p=$ 0.8), or an interaction between the two factors ($t(119)=$ 0.19, $p=$ 0.78). 

\subsection{Discussion}
The results replicate the findings of \citeA{gorman1989error} and \citeA{penner1996trust}, who found that people are more likely to question feedback when it disconfirms their hypothesis.  For attributions to error, most of these judgments were normatively justified, matching the Bayesian consistency criterion on 92 of 144 trials (64\%).  In spite of this consistency, participants were unable to identify actual error.  Finally, on a task new to this study, participants shared information at equal rates regardless of whether the feedback was affirming or disconfirming and regardless of whether it was attributed to error.

The positive test strategy \cite{klayman1987confirmation} entails seeking affirming evidence and discounting disconfirming evidence.  Experiment One found that this strategy is both internally consistent and inaccurate.  Participants were, however, no less likely to share disconfirming or seemingly flawed data.  Although this pattern of data sharing contradicts the positive test strategy, we observed that some participants had difficulty interpreting the open-ended data-sharing question.  Namely, when asked which trials they wanted to share, some responded with a triple (e.g., ``2, 4, 6''), rather than a trial (e.g., ``trial 3'').  Experiment Two addresses this problem by using a fixed response format after each trial rather than an open-ended one at the end.

\section{Experiment Two}
Experiment One replicated the positive test strategy found in previous studies, with participants invoking error more often for disconfirming feedback.  These attributions to error were justified in terms of the consistency criterion, but not the accuracy criterion.  The second experiment examines the effects of incentives on these judgments, using a monetary payoff that encourages participants to convince themselves that they know the rule.  Specifically, participants were offered \$100 for guessing the Actual Rule correctly, and \$1 for concluding that they do not know it.  This incentive scheme sought to encourage motivated reasoning, so that hopeful participants believe that they've reached a correct conclusion rather than assess their knowledge candidly.  

Experiment Two also improves the data-sharing decision.  Immediately after receiving feedback, participants make a binary (Yes--No) decision about whether each trial should be shared.  Having data-sharing decisions at the end of each trial rather than at the end of the experiment sought to make it clearer that the sharing decision applies to the current trial, resolving any ambiguity about whether triples or trials should be shared.  It also elicits sharing judgments earlier in the task, before participants might become tired or frustrated.

\subsection{Method}
\subsubsection{Participants} 
Fifty-eight Carnegie Mellon University undergraduates participated in the experiment for course credit. There were thirty-four women, with average age of 20 years (range: 18 -- 24).

\subsubsection{Design}
Participants were randomly assigned to either the control or the incentive condition.  This was a one-way between-subjects design with two levels. 

\subsubsection{Procedure}
The entire experiment lasted 30 minutes.  Participants were given informed consent, instructions, and the response sheet. At the end of the experiment, they were asked to leave their email address, with the promise that they would be contacted later if they had solved the rule to receive their bonus payment.  This delay of bonus payment was done to prevent participants from telling their friends the correct answer.

\subsubsection{Materials} 
All materials were the same as those in the first study except for the following three changes.  First, participants were given the spreadsheet, but were not required to use it.   

Second, in the financial incentive condition, participants were told: 

\begin{quote}
``At the end of the experiment you will be given a chance to win money by guessing the rule. If you decide to guess the rule you will receive 100 dollars if the guess is exactly correct, but 0 dollars if the guess is incorrect. On the other hand, you can decide that you do not know and receive 1 dollar for sure.''
\end{quote}

Third, decisions to share a trial were made immediately after participants made their attributions to error:  

\begin{quote}
``We are also interested in how people share information. In a future experiment, a new participant will try to discover the same Actual Rule that you are trying to discover. You can share information with this new participant to help him or her solve the Actual Rule.  If the new participant solves the rule, you will receive an additional 50 dollars.'' 
\end{quote}

\begin{flushleft}
The trials were described in the same way as Experiment One, but the sharing judgment was now binary:
\end{flushleft}
\begin{quote}
``Do you think this trial should be shared with a new participant? (Yes/No)''
\end{quote}

\subsection{Results}


\subsubsection{Incentives and Performance}

Incentives doubled the median number of trials from 4.5 to 9.\footnote{Although the median number of trials increased, a non-parametric Kolmogorov-Smirnov (KS) test for differences in empirical cumulative distributions indicates no differences in distribution.  Between Experiment One and the control condition of Experiment Two, the KS test was $D=$ 0.31, $p=$ 0.23.  Between Experiment Two control and incentive conditions, the KS test was $D=$ 0.32, $p=$ 0.11.  Thus, although the medians were different, the distributions of trials between the studies and conditions were similar.}  Using the same scoring method as Experiment One, those in the incentive condition scored about the same on average ($M=1.58$, $SD=1.21$) as those in the control condition ($M=1.66$, $SD=1.21$), $t(56)=$ 0.80, $p>$ 0.05.  One participant solved the rule exactly, and was compensated with a \$99 Amazon gift card.

\subsubsection{Attributions to Error}
As in Experiment One, those in the control condition were significantly more likely to see feedback as error when it was disconfirming (29\%, $SE=$ 5.8\%), than when it was affirming (10\%, $SE=$ 4.1\%), $t(171)=$ 2.89, $p<$ 0.05, $d=$ 0.22.  In contrast, participants in the incentive condition were equally likely to attribute error to disconfirming feedback (16\%, $SE=$ 3.7\%) and to affirming feedback (20\%, $SE=$ 6.7\%), $t(309)=$ 0.62, $p>$ 0.05, $d=$ 0.04.  Thus, although we expected the incentives to increase motivated reasoning, they appeared to reduce the tendency for participants to attribute disconfirming results to error.  In multiple regression, there was a significant main effect of feedback type ($t(476)=$ 2.5, $p=$ 0.038), incentive ($t(476)=$ 2.4, $p=$ 0.05), and a significant interaction between the two factors, where disconfirming feedback only increased error attributions for those in the control condition ($t(476)=$ 3.1, $p=$ 0.0069).  There were no other main effects, two-way, or three-way interactions between feedback type, incentive condition, and actual error.\footnote{This reversal of error attributions may be partially explained by the implied difficulty of the rule.  Participants in the incentive condition expected their triples to fit the rule less often ($M=$ 0.41, $SD=$ 0.36) than those in the control condition, ($M=$ 0.49, $SD=$ 0.3), $t(482)=$ -2.4, $p=$ 0.021.}

\subsubsection{Bayesian Consistency}


%\begin{figure}[h] \pause
%    \centering
%\scalebox{1}{\includegraphics[\textwidth]{was2cons}}
%\caption[Experiment Two Trials Attributed to Error Compared to Bayes' Rule]{Proportion of trials attributed to error depending on whether Bayes' Rule predicted error attribution and whether the feedback was affirming or disconfirming.}
%\label{fig:was2fig}
%\end{figure}
%As seen in Figure~\ref{fig:was2fig}, 
Attributions to error for participants in the control condition were consistent with their prior beliefs.  For affirming feedback, they correctly attributed 3 of 10 trials to error and incorrectly attributed 0 of 47 trials to error, $\chi^{2}(1) = 3.9$, $p=$ 0.057, $\phi=$ 0.24.\footnote{A hierarchical model could not be used for the control condition.  Only one participant both made an attribution to error and should have not made an attribution to error.  Thus, only one subject-level intercept could be fit, as all other participants had zero probability of judging error.  To deal with this we pool all of the data together to get an approximate answer.}  For disconfirming feedback they correctly attributed 9 of 13 trials to error and incorrectly attributed 16 of 82 trials to error, $\chi^{2}(1) = 10$, $p<0.05$, $\phi = 0.32$.  The overall correlation between their attributions to error and the consistency criterion was $\phi=$ 0.24, $\chi^{2}(1)=$ 10, $p=$ 0.0015.

Participants in the incentive condition exhibited similar consistency.  For affirming feedback, they correctly attributed 12 of 38 trials to error and incorrectly attributed 15 of 76 trials to error, $\chi^{2}(1)=$ 2.9, $p=$ 0.088, $\phi = 0.16$.  For disconfirming feedback they attributed 11 of 31 trials to error correctly and incorrectly attributed 22 of 155 trials to error, $\chi^{2}(1) = 16$, $p<0.05$, $\phi = 0.29$.  The overall correlation between their error attributions and the consistency criterion was $\phi = 0.19$, $\chi^{2}(1) = 12$, $p=0.001$.

\subsubsection{Accuracy}

As in Experiment One, participants in the control condition were unable to identify when actual errors occurred.  They correctly identified 26\% of actual errors and incorrectly identified 20\% of non-errors as error, $\chi^{2}(1)=$ 2.2, $p=$ 0.37, $\phi=$ 0.093.  For the incentive condition, participants were also unable to identify actual error.  They correctly identified 21\% of actual errors and incorrectly identified 21\% of non-errors as error, $\chi^{2}(1)=$ 1.3, $p=$ 0.44, $\phi=$ 0.056.

\subsubsection{Data Sharing}

In contrast to Experiment One, participants in the control condition shared a smaller proportion of trials when the feedback was disconfirming (84\%, $SE=$ 8.5\%) than when it was affirming (93\%, $SE=$ 5.5\%), $t(171)$ = 1.96, $p=$ 0.05, $d=$ 0.15.  Similarly, they shared a smaller proportion of trials when they judged the feedback to be an error (79\%, $SE=$ 12\%) than when they judged it to be accurate (91\%, $SE=$ 6.1\%), $t(171)=$ 1.98, $p<$ 0.05, $d=$ 0.15.  Participants in the incentive condition also shared a smaller proportion of trials when the feedback was disconfirming (84\%, $SE=$ 6.2\%), than when it was affirming (94\%, $SE=$ 3.7\%), $t(309)=$ 2.95, $p<$ 0.05, $d=$ 0.17.  They also shared a smaller proportion of trials when they judged the feedback to be an error (71\%, $SE=$ 11\%) than when they judged the feedback to be accurate (91\%, $SE=$ 3.8\%), $t(309)=$ 3.94, $p<$ 0.05, $d=$ 0.22.  

In multiple regression, there was only a significant main effect of error attribution ($t(476)=$ 2, $p=$ 0.053), and a marginally significant interaction between actual error and incentive condition, such that those in the incentive condition were more likely to share actual errors than those in the control condition ($t(476)=$ 1.7, $p=$ 0.086).  There were no other main effects, two-way, or three-way interactions between feedback type, actual error, and incentive condition. 

\subsection{Discussion}

Experiment Two again found that participants more often attribute error to disconfirming feedback when given no incentive beyond their intrinsic motivation to solve the problem.  However, participants who were offered a large incentive for getting the rule attributed error to affirming and disconfirming feedback at equal rates.  Although we had expected the incentive for getting the rule to increase motivated reasoning, it actually reduced the tendency for participants to attribute disconfirming feedback to error.  It did not, however, lead to error attributions that were either more accurate or more consistent with prior expectations.  Participants in the control condition met the consistency criterion on 125 of 152 trials (82\%), which was a higher rate than those in the incentive condition (217 of 300 trials, 72\%).  One possible explanation is that the incentive helped participants maintain a more balanced perspective on the likelihood of error after receiving feedback; however, in spite of their motivation, they lacked the understanding (e.g., of Bayes' Rule) needed to respond consistently. An alternative explanation is that participants in the incentive condition rushed through the prior probability and error attribution questions in order to complete more trials, thereby creating more chances to propose triples and get feedback.  This strategy would reduce consistency and make attributions of error more equal across feedback types, and is consistent with the finding that participants in the incentive condition completed twice as many trials in the same time period as those in the control condition.

For both the control and the incentive groups, participants shared disconfirming feedback less frequently than affirming feedback.  They also shared feedback that they attributed to error less frequently than feedback that they saw as accurate.  Those error attributions were loosely justified by internal consistency, but not by accuracy.  Extrapolating to scientific contexts, researchers may have defensible reasons to omit data from publication based on their expectations, but that this consistency may not prevent harm to those who must use the data.  Before reaching that conclusion, we address one possible artifact in Experiment Two's procedure: placing the sharing decision immediately after the error attribution task, perhaps suggesting that the two should be related.  Experiment Three remedies this possible confound by eliciting data sharing decisions and error attributions both during each trial and at the end of the task, also allowing participants to reflect on all the data before making their final error attributions and data-sharing decisions.

Finally, Experiment Two's incentive scheme sought to motivate participants to believe they knew the rule.  However, the value of data are usually determined not by the person who collects the data themselves, but by others, such as reviewers (for journals) or regulatory bodies (for drug approval).  These people, who are external to the data collection process, determine the reward to the researcher based on their prior beliefs and their evaluation of the data shared with them.  To simulate this incentive system more closely, Experiment Three uses the natural expectations that participants have about how to convince another person.  We expect that an incentive to convince another person should increase the preference for discounting disconfirming feedback.

\section{Experiment Three}

Experiment Three replicates Experiment Two with several modifications.  Most importantly, a new condition provides an incentive for participants to convince another person that their proposed Final Answer is correct, with data-sharing as the sole mode of communication between them.  To do this, we embed the Wason task in a teacher--learner game, a type of principal--agent game \cite{fudenberg1991game,shaftoepistemic}.  In this task, the participant collecting the data (the teacher) shares data with another person (the learner) who has to guess the rule based on the data that the teacher decides to share.  

The teacher is in one of two incentive conditions.  The \emph{compatible} incentive condition rewards both the teacher and learner if the learner guesses the rule.  In the \emph{perverse} incentive condition, the learner's rewards remain the same, but the teacher receives money if the learner accepts the teacher's Final Answer.  Thus, the perverse incentive allows the teacher to distort the data supplied to the learner, potentially increasing her own payoff while reducing the learner's reward.  In this scenario, the teacher knows the entire game structure, but the learner does not.  Specifically, the learner is not told that the teacher does not have to share all the trials that were conducted, and the teacher is told that the learner only knows about the shared trials.

Experiment Three also deals with two methodological issues brought up in Experiment Two.  One is that participants in the incentive condition attributed affirmation and disconfirmation to error equally, but were slightly less consistent in their attributions to error than participants in the control condition.  This may have reflected their rushing through the task to complete more trials.  To reduce this threat, we use a penalty for making incorrect prior probability and error attributions.  Any payoff to the participant is reduced in proportion to their inaccuracy on these two measures.  This penalty prevents them from performing one element of the task well (collecting many trials) at the cost of the other elements (rushing through attributions to error).  The second was the possibility that participants assumed that the data sharing and error attribution judgments should be related because they occurred sequentially on each trial.  This could create a false correlation between the two measures based on the participant's belief that the experimenter put the two questions close to each other for a reason.  To deal with this, we also elicit data sharing decisions and attributions to error at the end of the task, using a fixed-response format rather than the open-ended format used in Experiment One.

\subsection{Method}


\subsubsection{Participants}  
One hundred Amazon Mturk volunteers completed the task for \$5. There were 46 women, with average age of 32 years (range: 18--65).

\subsubsection{Design}
The design was a 2 level (perverse or compatible incentive) between-subjects design.

\subsubsection{Materials}
The procedure and materials were the same as in Experiment Two except for the following modifications.  First, participants completed three `practice trials' to help them understand the task.  They were then told the following: 

\begin{quote}
 ``We are also interested in how people share information.  The information comes in trials.  A trial is a page where you proposed a triple and received feedback.  The practice trials you conducted are shown below.  For each trial you share, another person will get the triple you proposed and the feedback you received.  The person will also receive the Final Answer you propose at the end of the task, regardless of the trials you share.''
\end{quote}

\begin{flushleft}
Participants were then told about possible bonus money:
\end{flushleft}

\begin{quote}
``Both you and the person you share trials with can earn up to a \$5 bonus in addition to the \$5 you receive for participating in the experiment.'' 
\end{quote}

\begin{flushleft}
The perverse incentive condition was followed with this text:
\end{flushleft}
\begin{quote}
``How you earn bonus money:
\begin{itemize}
  \item If the other person thinks your Final Answer matches the Actual Rule exactly, then you get \$5.
  \item If the other person thinks your Final Answer does not match the Actual Rule at all, then you get \$0.
  \item If the other person thinks your Final Answer somewhat matches the Actual Rule, then you get somewhere between \$0 and \$5.''
\end{itemize}
\end{quote}
\begin{quote}
``How the person you are sharing trials with earns bonus money: \newline
The person you are sharing trials with can also earn money.
\begin{itemize}
  \item This person gets the most money (\$5) by correctly judging how well your Final Answer matches the Actual Rule.
  \item If this person thinks your Final Answer matches the Actual Rule, but it does not, the other person gets less money. 
  \item If this person thinks your Final Answer does not match the Actual Rule, but it is does, the other person gets less money.''
    \end{itemize}
\end{quote}

\begin{flushleft}
Those in the compatible incentive condition were told:
\end{flushleft}

\begin{quote}
  \begin{itemize}
  \item ``If the other person's guess matches the Actual Rule exactly, then you both get \$5.
  \item If the other person's guess does not match the Actual Rule at all, then you both get \$0.
  \item If the other person's guess somewhat matches the Actual Rule, then you both get somewhere between \$0 and \$5.''
  \end{itemize}
  \end{quote}

\begin{flushleft}
Finally, participants were told the penalty for making incorrect attributions:
\end{flushleft}
\begin{quote}
``Penalty for wrong answers

Any bonus you get will be reduced if your false feedback and probability judgments are wrong. Thus, to earn the most money you should make your false feedback and probability judgments as accurate as possible.''
\end{quote}

\subsection{Results}

\subsubsection{Incentives and Performance}

As in Experiment Two, participants in the compatible and perverse incentive conditions completed a median of about 8 trials (9 and 7, respectively), $t(97)=$ -0.12, $p=$ 0.91, $d=$ -0.012.%$  Using the same scoring method as before, those in the compatible incentive condition scored about the same ($M=$ 1.8, $SD=$ 1.2) as those in the perverse incentive condition ($M=$ 1.6, $SD=$ 1.1), $t(98)=$ 0.87, $p=$ 0.39. %$  

\subsubsection{Attributions to Error}

Both during (38\% vs. 4.8\%) and at the end of the task (41\% vs. 12\%), those in the compatible incentive condition were more likely to see feedback as in error when it was disconfirming than when it was affirming, ($t(510)=$ 7.7, $p<$ 0.001, $d=$ 0.34; $t(525)=$ 6.5, $p<$ 0.001, $d=$ 0.28, respectively).  Similarly, both during (39\% vs. 9.3\%) and at the end of the task (47\% vs. 7.9\%), those in the perverse incentive condition were significantly more likely to see feedback as in error when it was disconfirming than when it was affirming ($t(537)=$ 7.3, $p<$ 0.001, \emph{d} = 0.32; $t(504)=$ 8.5, $p<$ 0.001, $d=$ 0.38, respectively). 

\subsubsection{Bayesian Consistency}

For both incentive groups, adding the penalty for incorrect error attributions and probability judgments greatly improved accuracy and consistency, as compared to Experiments One and Two.  For the compatible condition, the overall correlation between their attributions to error and the consistency criterion was $\phi=$ 0.37, $\chi^{2}(1)=$ 76, $p<$ 0.001.  Participants in the perverse incentive condition exhibited even greater consistency, $\phi=$ 0.55, $\chi^{2}(1)=$ 179, $p<$ 0.001.

%\footnote{For affirming feedback in the compatible condition, they correctly attributed 3 of 5 trials to error, and incorrectly attributed 8 of 258 trials to error, $\chi^{2}(1)=$ 3.9, $p<$ 0.057, $\phi=$ 0.24.  For disconfirming feedback they correctly attributed 88 of 241 trials to error, and incorrectly attributed 0 of 5 trials to error, $\chi^{2}(1)=$ 10, $p<0.05$, $\phi=$ 0.32.  For affirming feedback in the perverse condition, they correctly attributed 5 of 7 trials to error, and incorrectly attributed 20 of 271 trials to error, $\chi^{2}(1)=$ 2.9, $p=$ 0.088, $\phi=$ 0.16.  For disconfirming feedback they attributed 95 of 241 trials to error correctly, and incorrectly attributed 0 of 10 trials to error, $\chi^{2}(1) = 16$, $p<0.05$, $\phi=$ 0.29.}  

\subsubsection{Accuracy}
Participants both in the compatible and perverse incentive conditions were able to accurately identify error during the task ($\chi^{2}(1) = 79$, $p<$ 0.001, $\phi = 0.37$; $\chi^{2}(1)=$ 139, $p<$ 0.001, $\phi=$ 0.5, respectively).  Participants in the compatible incentive group correctly identified 44 of 99 actual errors and incorrectly identified 60 of 458 non-errors as error.  For the perverse incentive condition, participants correctly identified 61 of 121 actual errors and incorrectly identified 66 of 474 non-errors as error.  This accuracy also slightly improved in judgments made at the end of the task for both the compatible and perverse incentive conditions ($\chi^{2}(1) = 121$, $p<$ 0.001, $\phi = 0.45$; $\chi^{2}(1) = 167$, $p<$ 0.001, $\phi = 0.57$, respectively).

\subsubsection{Data Sharing}
Participants in the compatible incentive condition shared 147 of 211 trials when the feedback was disconfirming $(76\%, SE = 6.7\%)$ and 191 of 203 when it was affirming $(97\%, SE = 1.4\%)$, $t(499)$ = 5.8, $p<$ 0.001, $\emph{d} = 0.26$.  Similarly, they shared 37 of 82 trials when they attributed feedback to error $(53\%, SE = 12\%)$ and 369 of 400 when they judged it to be accurate $(97\%, SE = 1.4\%)$, $t(570)$ = 7.6, $p<$ 0.001, $\emph{d} = 0.32$.  

At the end of the task, participants shared 112 of 180 trials that they judged to be an error $(71\%, SE = 16\%)$ and 391 of 414 when they judged it to be accurate $(99\%, SE = 1.5\%)$, $t(515)$ = 2.5, $p=$ 0.037, $\emph{d} = 0.11$.  However, there was also significant variation across participants in how much data they shared when they perceived the feedback to be an error, $\chi^{2}(1)=$ 62, $p<$ 0.001.  As can be seen in Figure~\ref{fig:was3fig}, most participants in the compatible incentive condition shared all of the trials they attributed to error at the end of the task, while a significant proportion shared none of those trials.  However, there was no such variation for data sharing in response to disconfirming feedback $\chi^{2}(2)=$ 2, $p=$ 0.51, or error attributions during the task, $\chi^{2}(2)=$ 1.5, $p=$ 0.66.  


\begin{figure*}[h] \pause
    \centering
\scalebox{0.9}{\includegraphics[\textwidth]{was3}}
\caption{Proportion of trials shared by whether the trial was disconfirming (top row), whether participants attributed that trial to error during the task (middle row), and whether participants attributed the trial to error at the end of the task (bottom row).}
\label{fig:was3fig}
\end{figure*}

Unexpectedly, participants in the perverse incentive condition did not share trials at lower rates than those in the compatible incentive condition.  They shared 171 of 218 trials when the feedback was disconfirming $(88\%, SE = 6.6\%)$ and 183 of 218 trials when it was affirming $(97\%, SE = 2.2\%)$, $t(520)$ = 1.8, $p<$ 0.17, $\emph{d} = 0.077$.  They also shared 68 of 102 trials when they judged the feedback to be an error during the task $(80\%, SE = 10\%)$ and 286 of 334 trials when they judged the feedback to be accurate $(96\%, SE = 2.8\%)$, $t(520)$ = 2.4, \emph{p} $<$ 0.046, $\emph{d} = 0.1$.  At the end of the task, they shared 74 of 130 trials that they judged to be an error $(81\%, SE = 20\%)$ and 333 of 354 trials that they judged to be accurate $(100\%, SE = 0.34\%)$, $t(481)$ = 2.1, $p=$ 0.08, $d=$ 0.098$.


As seen in Figure~\ref{fig:was3fig}, there was significant variation across participants in their decisions to share data after receiving disconfirming feedback, $\chi^{2}(1)=$ 11, $p=$ 0.075, whether they shared data that they perceived to be error during the task, $\chi^{2}(1)=$ 9.1, $p=$ 0.03, and whether they shared data that they perceived to be error at the end of the task, $\chi^{2}(1)=$ 27, $p<$ 0.001.  For all three judgments, most participants in the perverse incentive condition shared all of their trials, with a minority sharing less.  

Our prediction was that some participants would be seduced by the perverse incentive, thus deciding only to share trials that were consistent with their final answer.  However, there was no difference between conditions in the probability of omitting data that were inconsistent with their final answer, $t(999)=$ 0.13, $p=$ 0.79.  A second way that participants could produce these results while exploiting the perverse incentive would be to seek out only affirming data, knowing the data would make a simple and convincing story.  One way to implement this weak testing strategy is to propose the (2,4,6) triple, knowing that they would receive affirming feedback unless the feedback is in error.  However, participants in the two incentive conditions were equally likely to propose (2,4,6) triples, $t(1154)=$ 0.59, $p=$ 0.67.

As participants were both accurate and consistent in their error attributions, they may have been able to remove actual errors from the data they shared.  Overall, at the end of the task participants shared 63 of 118 (53\%) trials that were both actual errors and perceived as errors, 62 of 65 (95\%) trials that were actual errors but not perceived as errors, 106 of 163 (65\%) trials that were perceived as errors but not actual errors, and 615 of 650 (95\%) trials that were neither perceived as error nor actual error.  When including both main effects and the interaction between actual error and attribution of error to predict whether each trial would be shared at the end of the task, there was only a significant main effect of error attribution, and not actual error, for both compatible and perverse conditions ($t(509)=$ 4.7, $p<$ 0.001 vs. $t(479)=$ 5.2, $p<$ 0.001, respectively).  This means that error attributions, but not actual errors, matter in determining whether data is shared.  

The reason perceived and actual errors diverged was that disconfirmation had a systematic and additive effect on perceived error, even after controlling for actual error.  Main effects of both actual error ($t(1025)=$ 7, $p<$ 0.001) and disconfirming feedback ($t(1025)=$ 7.3, $p<$ 0.001) increased the chance of attributing a trial to error at the end of the task, with no significant interaction between the two ($t(1025)=$ 1.6, $p=$ 0.23).  Thus, affirming trials were shared more often, as they were less likely to be perceived as errors than disconfirming trials even when they were actually errors, whereas disconfirming trials were shared less frequently because they were inappropriately seen as errors when they were not.

\subsection{Discussion}

Participants with a compatible or perverse incentive to share data were equally likely to attribute disconfirming feedback to error.  The financial penalty for making incorrect probability judgments and attributions to error produced greater consistency and accuracy, compared to Experiments One and Two.  Participants in both incentive conditions also shared fewer trials whose feedback was disconfirming or attributed to error, either during or at the end of the task.  Although participants were successful in identifying actual errors, it was attributions to error that determined whether they shared trials, indicating that being able identify error does not preclude failing to share trials with accurate disconfirmations, while sharing ones with inaccurate affirmations.

We expected the perverse incentive to reduce the consistency and accuracy of attributions to error, as well as to reduce the sharing of data attributed to error.  However, such motivated reasoning was not observed.  Rather, data sharing behavior in the two conditions differed in an unexpected way.  Both for decisions made after each trial and at the end of the task, participants in the perverse incentive condition shared \emph{more} data than those in the compatible incentive condition--thereby demonstrating a more ethical data sharing stance.  While it is possible that higher stakes, such as those involved in pharmaceutical or academic research, would lead to motivated reasoning and data sharing policies, participants responded to the moderate stakes used in this research with reasoned and ethical behavior.  

In decisions made at the end of all trials, however, some participants in the perverse incentive condition decided to share none of the data they attributed to error.  Contrary to our prediction, these participants did not omit more trials that were inconsistent with their final answer than those in the compatible incentive condition.  Additionally, those in the perverse incentive condition did not try to produce a convincing story in as few trials as possible, in order to reduce the risk of collecting inconvenient data that would make their Final Answer less convincing or requiring selective reporting.

There are several possible explanations for why participants in the perverse incentive condition shared trials at a higher rate than those in the compatible incentive condition.  First, they may have thought that the learner knows they can hide data, even though the instructions indicated that the other participant would only know about the trials they decided to share.  Second, they may have believed that sharing more trials increases the learner's confidence, regardless of whether the trials are consistent with their Final Answer.  Third, they may have been more strongly motivated to do the right thing and give the learner all the data available, even if that came at the cost of their own compensation.  Open-ended responses at the end of the task show four examples of such motivation:

\begin{quote}
\begin{enumerate}
\item ``Yes. / It was an exercise in thinking about probabilities and cooperating with another. ''
\item ``I think there was some deception involved.  This experiment may be about how willing the participant is to share money.''
\item ``seems more like a trust then a math problem. ''
\item ``I shared everything because, not knowing if the FIT/DNF response by the computer was correct, I didn't want to deliberately bias the info I passed on by being selective.''
\end{enumerate}
\end{quote}


Thus, Experiment Three extends the positive test strategy to communication of results, seen in selective reporting, such that disconfirming data are seen as both caused by error and not worthy of sharing with others.  Contrary to our prediction of motivated reasoning \cite{kunda1990case}, the perverse incentive condition not only did not increase error attributions, but increased the sharing of data that were disconfirming or attributed to error.

\section{Experiment Four}


Experiment Three found that participants given a perverse incentive shared more trials that were disconfirming or attributed to error than those given a compatible incentive.  On the surface it appears that they were genuinely willing benefit others, at potential financial cost to themselves.  To test this explanation, at the end of Experiment Four all participants were told the Actual Rule and then were allowed to modify the data they shared, but not adjust their Final Answer.  Thus, if participants given a perverse incentive care about ethics and altruism they should change the data they share to match the correct answer, even at the likely cost to their own payoff.  However, if other concerns determine their data sharing, such as uncertainty about whether trials were errors, then knowing the correct answer should allow them to share only trials that are consistent with their Final Answer.\footnote{To promote greater exploration of the hypothesis space and give more disconfirming feedback, a pre-test for Experiment Four added an additional condition to the rule: that only \emph{odd multiples of two} fit the rule.  Unfortunately, this did not increase the median number of trials completed (7).  All other results replicated:  They were more likely to see feedback as in error when it was disconfirming than when it was affirming, ($t(322)=$ 5.3, $p<$ 0.001, $d=$ 0.29; Bayesian Consistency $\phi=$ 0.45, $\chi^{2}(1)=$ 110, $p<$ 0.001; accuracy ($\chi^{2}(1) = 48$, $p<$ 0.001, $\phi = 0.38$); and data sharing after disconfirmation $t(263)$ = 3.9, $p<$ 0.001, $\emph{d} = 0.24$; data sharing after error $t(259)$ = 4.5, $p<$ 0.001, $\emph{d} = 0.28$; end of task $t(373)$ = 4.8, $p=$ 0.001, $\emph{d} = 0.25$ with significnat heterogeneity $\chi^{2}(1)=$ 33, $p=$ 0.0082.}

Experiment Four also seeks to rule out two alternative explanations for the data sharing results of Experiment Three.  Participants may have thought that the person receiving the data knew they could hide trials, hence might become suspicious if the data were too orderly.  To clarify that this was not possible, we modified the wording to be clear that the other person cannot know that they collected more trials than they shared, if they decide to do so:

\begin{quote}
  ``We are also interested in how people share information.  The information comes in trials.  A trial is a page where you proposed a triple and received feedback.  For each trial you share, another person will get the triple you proposed and the feedback you received.  This person only knows about the trials you share.  This person does not know how many trials you completed or that you did not have to share all of your trials.''
 \end{quote}

Second, explicitly mentioning the other person may evoke concerns about the welfare of the person receiving the data.  Similarly, data sharing judgments were worded as ``information sharing'', possibly sending the message to participants in the perverse incentive condition that they should share, rather than hide, data.  Instead, some participants may wish to deceive the other person, but perceived that the `sharing' label provided by the experimenter meant that concern for the other person, rather than deception, is the expected behavior, introducing a demand characteristic \cite{orne1962social,weber1972subject,nichols2008good}.  To control this, we used more neutral language, where all mentions of `sharing' were changed to `communicate', and any mention of the other participant was removed, wherever possible.  For example, the trial-by-trial judgments were changed from:

\begin{quote}
  Do you think this trial should be shared with a new participant?
\end{quote}

to

\begin{quote}
Do you think this trial should be communicated?
\end{quote}

This wording was supposed to allow participants to decide whether the data communication was about sharing or deception themselves, rather than try to infer what the experimenter wants.  

\subsection{Method}


\subsubsection{Participants}  
One hundred twenty three Amazon Mturk volunteers completed the task for \$5. There were 61 women, with an average age of 32 years (range: 18--69).

\subsubsection{Design}
The design was a 2 level (perverse or compatible incentive) between-subjects design.

\subsubsection{Materials}
The procedure and materials were the same as in Experiment Three except for the following modifications.  All mentions of `sharing' were changed to `communication'.  The data sharing task was described as follows:

\begin{quote}
``For each trial you communicate, another person will get only the triple you proposed and the feedback you received, nothing else.

The person will also receive the Final Answer you propose at the end of the task, regardless of the trials you communicate.  However, this person will only know about the trials you communicate, and does not know how many trials you completed or that you did not have to communicate all of your trials.''
\end{quote}

\begin{flusleft}
All data-sharing judgments were changed to the following wording:
\end{flushleft}

\begin{quote}
Do you think this trial should be communicated?
\end{quote}

Finally, participants were given the Actual Rule after they gave their final answer, and were allowed to change their data sharing judgments.

\begin{quote}
``In words, the actual rule is ascending consecutive even numbers from 2-100. \\
In math, the actual rule is:
\begin{itemize}
  \item There are three numbers, call them num1, num2, num3, in order. 
  \item $num2 - num1 = 2$
  \item $num3 - num2 = 2$
  \item $num1 is even$
  \item $num1 > 2$
  \item $num3 < 100$
\end{itemize}
If you want, you can now decide to change the trials that you communicate below.'' \footnote{The last two constraints should also include an equal to.  This may have confused participants.  However, alternative analyses taking these differences into accounts did not change the results substantially.}
\end{quote}

\subsection{Results}
%\subsubsection{Incentives and Performance}
%As in Experiment Three, participants in the compatible and perverse incentive conditions completed a median of about 8 trials (9 and 7, respectively), $t(xx)=$ -0.12, $p=$ 0.91, $d=$ -0.012.%$  

\subsubsection{Attributions to Error}

There were main effects of both actual error ($t(1365)=$ 7.7, $p<$ 0.001), and feedback ($t(1365)=$ 6.1, $p<$ 0.001), on attributions to error, with an interaction between actual error and incentive, such that actual error had less of an influence on attributions to error in the perverse incentive condition than compatible incentive condition ($t(1365)=$ 2.7, $p=$ 0.011).  For attributions at the end of the task, those in the perverse incentive condition attributed affirming feedback to error less than those in the compatible incentive condition (3.8\% vs. 9.9\%), but the reverse was true for disconfirming feedback (43\% vs. 38\%), ($t(1369)=$ 3.7, $p<$ 0.001, \emph{d} = 0.099).

%Both during (38\% vs. 6.3\%) and at the end of the task (38\% vs. 9.9\%), those in the compatible incentive condition were more likely to see feedback as in error when it was disconfirming than when it was affirming, ($t(714)=$ 8.8, $p<$ 0.001, $d=$ 0.33; $t(724)=$ 7.6, $p<$ 0.001, $d=$ 0.28, respectively).  Similarly, both during (42\% vs. 5.5\%) and at the end of the task (43\% vs. 3.8\%), those in the perverse incentive condition were significantly more likely to see feedback as in error when it was disconfirming than when it was affirming ($t(599)=$ 9, $p<$ 0.001, \emph{d} = 0.37; $t(645)=$ 9.9, $p<$ 0.001, $d=$ 0.39, respectively).  

\subsubsection{Bayesian Consistency}

For the compatible condition, the overall correlation between their attributions to error and the consistency criterion was $\phi=$ 0.56, $\chi^{2}(1)=$ 246, $p<$ 0.001.  Participants in the perverse incentive condition exhibited similar consistency, $\phi=$ 0.56, $\chi^{2}(1)=$ 205, $p<$ 0.001.

%\footnote{There was also a main effect of perverse incentive, where they were slightly less consistent than those in the compatible group ($t(1263)=$ 2.3, $p=$ 0.031, \emph{d} = 0.064.  For affirming feedback in the compatible condition, they correctly attributed 12 of 19 trials to error, and incorrectly attributed 12 of 326 trials to error, $\chi^{2}(1)=$ 3.9, $p<$ 0.057, $\phi=$ 0.24.  For disconfirming feedback they correctly attributed 134 of 325 trials to error, and incorrectly attributed 4 of 42 trials to error, $\chi^{2}(1)=$ 10, $p<0.05$, $\phi=$ 0.32.  For affirming feedback in the perverse condition, they correctly attributed 3 of 9 trials to error, and incorrectly attributed 14 of 313 trials to error, $\chi^{2}(1)=$ 2.9, $p=$ 0.088, $\phi=$ 0.16.  For disconfirming feedback they attributed 115 of 249 trials to error correctly, and incorrectly attributed 1 of 22 trials to error, $\chi^{2}(1) = 16$, $p<0.05$, $\phi=$ 0.29.} 

\subsubsection{Accuracy}
Participants both in the compatible and perverse incentive conditions were able to accurately identify error during the task ($\chi^{2}(1) = 109$, $p<$ 0.001, $\phi = 0.39$; $\chi^{2}(1)=$ 111, $p<$ 0.001, $\phi=$ 0.43, respectively).

%\footnote{Participants in the compatible incentive group correctly identified 57 of 135 actual errors and incorrectly identified 107 of 651 non-errors as error.  For the perverse incentive condition, participants correctly identified 61 of 121 actual errors and incorrectly identified 75 of 540 non-errors as error.  This accuracy also slightly improved in judgments made at the end of the task for the compatible but not perverse incentive condition ($\chi^{2}(1) = 199$, $p<$ 0.001, $\phi = 0.47$; $\chi^{2}(1) = 96$, $p<$ 0.001, $\phi = 0.38$, respectively).}

\subsubsection{Data Sharing}

Participants in the compatible and perverse incentive conditions shared trials at similar rates.  They shared fewer trials (69\% vs. 73\%) when the feedback was disconfirming than affirming (93\% vs. 95\%), a main effect of feedback only ($t(1072)$ = 7.2, $p<$ 0.001, $\emph{d} = 0.22$), and no other main effects or interactions.  During the task they shared fewer trials that they attributed to error (37\% vs. 38\%) than they saw as accurate (94\% vs. 97\%) a main effect of attribution only ($t(1073)$ = 10, $p<$ 0.001, $\emph{d} = 0.31$), and no other main effects or interactions.  At the end of the task they shared fewer trials that they attributed to error (56\% vs. 55\%) than trials they saw as accurate (94\% vs. 97\%) a main effect of attribution only ($t(1478)$ = 11, $p<$ 0.001, $\emph{d} = 0.27$), and no other main effects or interactions.  Both conditions also exhibited the bimodal sharing pattern as in Experiment Three, with some participants sharing all data attributed to error at the end of the task, and others sharing none ($\chi^{2}(1)=$ 283, $p<$ 0.001), and no interaction with incentive condition.

%As seen in Figure~\ref{fig:was4fighist}, 

%of trials when they attributed feedback to error and 94\%, ($SE=$ 2.3\%) of trials when they judged it to be accurate during the task; and 

 %($SE=$ 10\%) of trials that they judged to be an error and 94\%, ($SE=$ 2.7\%) of trials when they judged it to be accurate at the end of the task ($t(588)$ = 5.6, $p<$ 0.001, $\emph{d} = 0.23$; $t(590)$ = 7.6, $p<$ 0.001, $\emph{d} = 0.31$; $t(853)$ = 4.4, $p<$ 0.001, $\emph{d} = 0.15$, respectively).  They also exhibited the bimodal sharing pattern as in Experiment Three, with some participants sharing all data attributed to error at the end of the task, and others sharing none $\chi^{2}(1)=$ 129, $p<$ 0.001.  

%($SE=$ 5.1\%)
%($SE=$ 2.5\%)
%($SE=$ 8.9\%)
 
%Participants in the perverse incentive condition shared at similar rates to those in the compatible incentive condition.  They shared (73\%, SE = 9.3\%)$ trials when the feedback was disconfirming and $(95\%, SE = 2.8\%)$ trials when it was affirming; 

%$(38\%, SE = 16\%)$ trials when they judged the feedback to be an error during the task and $(97\%, SE = 2.4\%)$ trials when they judged the feedback to be accurate during the task; and $(55\%, SE = 18\%)$ trials that they judged to be an error and $(97\%, SE = 2.6\%)$ trials that they judged to be accurate at the end of the task ($t(484)$ = 5.7, $p<$ 0.001, $\emph{d} = 0.26$; $t(483)$ = 5.5, \emph{p} $<$ 0.001, $\emph{d} = 0.25$; and $t(625)$ = 8.9, $p=$ 0.001, $d=$ 0.36, respectively).  Similar to those in the compatible incentive condition, they exhibited the bimodal sharing pattern as in Experiment Three, with some participants sharing all data attributed to error at the end of the task, and others sharing none $\chi^{2}(1)=$ 128, $p<$ 0.001.

%\footnote{Participants in the compatible incentive condition shared 197 of 312 trials when the feedback was disconfirming $(69\%, SE = 5.1\%)$ and 248 of 278 trials when it was affirming $(93\%, SE = 2.5\%)$, $t(588)$ = 5.6, $p<$ 0.001, $\emph{d} = 0.23$.  Similarly, they shared 56 of 132 trials when they attributed feedback to error $(37\%, SE = 8.9\%)$ and 389 of 460 when they judged it to be accurate $(94\%, SE = 2.3\%)$, $t(590)$ = 7.6, $p<$ 0.001, $\emph{d} = 0.31$.  At the end of the task, participants shared 97 of 206 trials that they judged to be an error $(56\%, SE = 10\%)$ and 585 of 649 when they judged it to be accurate $(94\%, SE = 2.7\%)$, $t(853)$ = 4.4, $p<$ 0.001, $\emph{d} = 0.15$.}

%$\chi^{2}(1)=$ 129, $p<$ 0.001.  
%As seen in Figure X.X, there was significant variation across participants in their decisions to share data after receiving disconfirming feedback, $\chi^{2}(1)=$ 17, $p=$ 0.001, whether they shared data that they perceived to be error during the task, $\chi^{2}(1)=$ 25, $p=$ 0.001, and whether they shared data that they perceived to be error at the end of the task, .  For all three judgments, most participants in the perverse incentive condition shared all of their trials, with a minority sharing less.  
%\footnote{Participants in the perverse incentive shared at similar rates to those in the compatible incentive condition.  They shared 119 of 234 trials when the feedback was disconfirming $(73\%, SE = 9.3\%)$ and 208 of 252 trials when it was affirming $(95\%, SE = 2.8\%)$, $t(xxx)$ = 5.7, $p<$ 0.001, $\emph{d} = 0.25$.  They also shared 26 of 114 trials when they judged the feedback to be an error during the task $(38\%, SE = 16\%)$ and 300 of 371 trials when they judged the feedback to be accurate $(97\%, SE = 2.4\%)$, $t(xxx)$ = 5.5, \emph{p} $<$ 0.001, $\emph{d} = 0.24$.  At the end of the task, they shared 79 of 157 trials that they judged to be an error $(55\%, SE = 18\%)$ and 424 of 470 trials that they judged to be accurate $(97\%, SE = 2.6\%)$, $t(xxx)$ = 8.9, $p=$ 0.001, $d=$ 0.4.}


Differing from Experiment Three, both attribution to error and actual error determined data sharing, with no interaction with the incentive.  There was a significant main effect of attribution to error ($t(1217)=$ 6.9, $p<$ 0.001) and actual error ($t(1217)=$ 4.5, $p<$ 0.001).  Overall, at the end of the task participants shared 61 of 154 (40\%) trials that were both actual errors and perceived as errors, 71 of 85 (84\%) trials that were actual errors but not perceived as errors, 101 of 184 (55\%) trials that were perceived as errors but not actual errors, and 825 of 905 (91\%) trials that were neither perceived as error nor actual error.


%\begin{figure*}[h] \pause
%    \centering
%\scalebox{0.9}{\includegraphics[\textwidth]{was4}}
%\caption{Proportion of trials shared by whether the trial was disconfirming (top row), whether participants attributed that trial to error during the task (middle row), and whether participants attributed the trial to error at the end of the task (bottom row).}
%\label{fig:was4fighist}
%\end{figure*}


\subsubsection{Data Sharing with Full Information}

After learning the Actual Rule, participants in the perverse incentive condition did not share trials that they knew to be errors at a higher rate than those in the compatible incentive condition, $t(1185)=$ 0.67, $p=$ 0.32.  Thus, they did not behave in an actively deceptive manner by communicating results that they knew to be false.  Instead, participants in the perverse incentive condition were more likley to share trials overall ($t(1187)=$ 1.9, $p=$ 0.071), but less likely to share trials that fit the Actual Rule, a marginally significant interaction ($t(1187)=$ 2, $p=$ 0.054).


%\begin{figure}[h] \pause
%    \centering
%\scalebox{1.2}{\includegraphics[\textwidth]{was4share}}
%\caption{Histogram of sharing consistent vs. inconsistent trials after knowing the rule.}
%\label{fig:was4consfig}
%\end{figure}

After being told the Actual Rule, those in the perverse incentive condition shared \emph{more} trials that fit the Actual Rule but were inconsistent with their Final Answer than those in the compatible condition, a significant three-way interaction, $t(1131)=$ 2.3, $p=$ 0.027.

%participants shared trials that were inconsistent with their final answer less $t(1131)=$ 3.8, $p=$ 0.001, and trials that fit the actual rule more $t(1131)=$ 5.9, $p=$ 0.001, both significant two-way interactions.  However, there was a two-way interaction, where participants shared trials that were inconsistent with their final answer but fit the Actual rule at a lower rate $t(1131)=$ 6.7, $p=$ 0.001.  There was another signficiant two-way interaction, such that participants in the perverse incentive condition shared trials that fit the Actual rule at a lower rate $t(1131)=$ 2.6, $p=$ 0.013.

%Our prediction was that some participants would be seduced by the perverse incentive, thus deciding only to share trials that were consistent with their final answer.  However, there was no difference between conditions in the probability of omitting data that were inconsistent with their final answer, $t(xxx)=$ 0.13, $p=$ 0.79.  A second way that participants could produce these results while exploiting the perverse incentive would be to seek out only affirming data, knowing they data would make a simple and convincing story.  One way to implement this weak testing strategy is to propose the (2,4,6) triple, knowing that they would receive affirming feedback unless the feedback is in error.  However, participants in the two incentive conditions were equally likely to propose (2,4,6) triples, $t(xxx)=$ 0.14, $p=$ 0.79.

\subsection{Discussion}

As in Experiment Three, participants attributed disconfirming feedback to error more than affirming feedback.  However, those in the perverse incentive condition attributed affirming feedback to error at a lower rate, and disconfirming feedback at a higher rate, than those in the compatible condition.  Although this interaction was small, likely due to chance, it may reflect the perverse incentive blinding participants to finding fault in affirming feedback while making them more sensitive to finding fault in disconfirming feedback.  Supporting this, those in the perverse incentive condition were less sensitive to actual error when making their attributions.  Although these two findings can be interpreted as evidence of motivated reasoning, neither effects were large enough to undermine either Bayesian consistency or accuracy.  

As in Experiment Three, participants in both conditions shared fewer disconfirming trials, fewer trials attributed to error during the task, and fewer trials attributed to error at the end of the task.  Likewise, participants in both treatment groups exhibited a bimodal data sharing pattern of trials attributed to error at the end of the task, either sharing all or none of these data.  These data sharing judgments depended on attributions to error even after controlling for actual error.

New to Experiment Four, when participants were given full information about the Actual Rule, those in the perverse incentive condition shared fewer trials that fit the Actual Rule than those in the compatible incentive condition, but did not share trials that they knew were errors at a higher rate.  Consistent with the results of Experiment Three, participants in the perverse incentive condition who knew a triple fit the Actual Rule but did not fit their Final Answer shared those trials at a higher rate than those in the compatible incentive condition.  Whether explained by reactivity, ethics, or something else, there was no evidence that participants took advantage of the perverse incentive to be deceptive by sharing less data.  This indicates that it was not uncertainty or partial knowledge that led to greater data sharing.  The perverse incentive did not lead to data sharing policies that a simple game-theoretic analysis would suggest.

In sum, participants given a perverse incentive either naively or explicitly considered the welfare of the person receiving the data, and decided to share data to help the other participant.  This stands in contrast to the perception of biased researchers who dispose of data that would harm their company's (or their own) profit.  Instead, those with and without an incentive to deceive another person shared trials that they attributed to error less than those that they saw as accurate, even after controlling for actual errors.  Instead of motivated reasoning, in both Experiments Three and Four those in the perverse incentive condition performed slightly, but not significantly, better in terms of both Bayesian consistency and accuracy.  Thus, we find strong evidence for a `cognitive' file-drawer problem, and weak or no evidence for a motivational one.  

%This may be the problem of focusing on the evaluation of another (XX mercier) that helps reasoning.  It may also be an accountability effect (Tetlock and Lerner XX).  
\section{General Discussion}

Over 50 years of psychological research has found that hypothesis testing follows a positive test strategy \cite{klayman1987confirmation}, whereby people collect data that they expect to affirm their expectations and discount disconfirming data, should it nonetheless reach them.  The present study asks how the positive test strategy affects data sharing.  We use the Wason 2-4-6 rule discovery task \cite{wason1960failure}, adding the possibility of error to simulate the uncertainty of actual research \cite{penner1996trust}.  In this task, participants seek to discover a rule by conducting `experiments' to test their hypotheses about its answer, then receive affirming or disconfirming feedback, known to have a 20\% error rate.  We extended the task by adding several incentive schemes, then examining their effects on participants' decisions about sharing the feedback they received with another person.  We also evaluated participants' performance in terms of the accuracy and consistency of their judgments of whether the feedback is error.

Experiment One replicated the pattern of results from previous studies, finding that disconfirming feedback is attributed to error more often than is affirming feedback \cite{penner1996trust}.  A new result is that participants' error attributions were generally consistent with their prior beliefs, in the sense of their being more likely to attribute affirmative feedback to error when they had strongly expected that the triple would not fit the rule, and being more likely to attribute disconfirming feedback to error when they had strongly expected the triple to fit the rule.  However, their judgments of whether the feedback was in error were unrelated to its accuracy.  Whether they shared trial results was unrelated to whether the feedback was disconfirming or attributed to error.

Experiment Two replicated Experiment One along with a new condition that provided participants with a large financial incentive for discovering the rule.  As in Experiment One, participants attributed disconfirming feedback to error at a greater rate than affirming feedback in the control condition, but not the incentive condition.  Those in the control condition again made error attributions that were somewhat consistent with their expectations but were quite inaccurate.  In contrast, participants in the incentive condition were neither consistent nor accurate.  Experiment Two elicited data sharing decisions after each trial, using a fixed-response format, unlike Experiment One which asked a single open-ended question at the end.  Participants in both conditions were more likely to share feedback if it was affirming and perceived to be accurate.

Experiment Three introduced two incentive schemes for sharing data: (a) \emph{compatible} incentives rewarded the sharer and receiver based on the receiver's success; (b) \emph{perverse} incentives rewarded the sharer based on whether the receiver believed that the problem had been solved, and did not disclose when data were not shared.  Both conditions penalized participants for making inaccurate probability and error judgments.  As before, participants in both conditions were more likely to attribute feedback to error when it was disconfirming.  The penalty increased both the accuracy and consistency of error attributions for participants in both conditions, compared to Experiments One and Two.  Contrary to prediction, participants with the perverse incentive shared more trials that were disconfirming or attributed to error than did participants with the compatible incentive.  In both conditions, despite these participants' ability to identify error feedback, their perception of error was more important than actual error in determining their data sharing.

Experiment Four replicated Experiment Three and used three additional controls.  Most importantly, uncertainty about the validity and usefulness of each trial was resolved by giving participants the Actual Rule at the end of the task, allowing them to change the data they shared but without letting them change their Final Answer.  Additionally, demand characteristics were controlled by changing the `data sharing' wording to `communication', and participants were explicitly told that the person receiving the data could not know that they chose not to share trials.  Experiment Four again found that participants in both incentive groups attributed disconfirming feedback to error and shared perceived errors at a lower rate.  Again, those in the perverse incentive condition did not exhibit motivated reasoning in terms of reduced Bayesian Consistency or accuracy, and shared more trials than those in the compatible incentive condition that were consistent with the Actual Rule but inconsistent with their Final Answer.

Four experiments provide support for the following interpretation of human error identification, communication, and motivated reasoning.  In terms of error identification, people naturally attribute disconfirming feedback to error, and this is above and beyond what is justified by actual error.  However, when penalized for failing to do so, they can identify error objectively, and do behave in a consistent (Bayesian) manner when making their error identifications.  Thus, people were biased, but not bad.  In terms of communication, participants consistently saw trials that they perceived to be errors as not worthy of sharing with another person, even after controlling for whether these trials were actual errors.  There was no evidence of motivated reasoning, either in terms of misattributing disconfirming results to error, or sharing unwanted results that could reduce one's financial profit.  In fact, participants frequently exhibited `reactive' \cite{dillard2005nature} or `ethical' behavior when confronted with a perverse incentive that may have been seen as encouraging deception.

The circumstances of the experiments differ from those of working scientists in several ways.  First, scientists never know the exact error rates in their experiments, but have, instead, just a range of plausible values based on their experience and intuition.  Those ambiguous error rates may be more readily modified to fit results than the fixed ones used in the experiments.  Second, although the patterns observed here generally parallel those observed in real labs \cite{dunbar1995scientists}, the participants were either undergraduates or MTurk respondents, not scientists.  The training and experience of working scientists may allow them to identify and report only accurate data, appropriately omitting errors that would confuse readers.

The results of four experiments suggest that financial penalties are needed to help participants accurately evaluate their data.  Without such penalties, Experiments One and Two elicited error attributions that were largely inaccurate and inconsistent with prior beliefs.  In Experiments Three and Four, adding a financial penalty for incorrect judgments substantially increased consistency and accuracy.  However, they still shared data that were systematically biased by feedback, including inaccurate affirmations and excluding accurate disconfirmations.  This selective reporting occurred even when poor data sharing could cost the sharer money, as in the compatible incentive condition.

The difficulty participants had when trying to avoid sharing errors shows that helpful selective reporting is not easy.  One strategy participants could have used to achieve accurate selective reporting would be to use exact replications.  Participants in all three experiments did not have the perfect accuracy in error attributions that would be required to selectively exclude errors from shared data.  At the end of the task, exact replications would allow participants to clearly identify which trials were error and which were accurate, and, in turn, selectively report only accurate data.  

Similar policies can help real scientists share data.  Experiment Three found that penalties for incorrect probability judgments and error attributions greatly increased consistency and accuracy.  One way to implement such a penalty would be to require that statistical analyses and experimental methods presented in published reports provide enough detail, in the paper or ancillary material, to be reproducible--with appropriate professional penalties for those who fail.  As a protection, researchers can adopt the protocols of impartial organizations dedicated to independent replication of experiments and analyses (e.g., \url{https://www.scienceexchange.com/}).  Another way of improving error identification is to encourage researchers to complete exact replications.  These replications allow researchers to identify errors with high accuracy and make selective reporting of perceived errors highly accurate.  

\bibliography{/home/alex/Dropbox/masterbib}
\end{document}

Failing to share data is usually indirect and the harm is not clear and not financial.  For example, not reporting a failure to solve a homicide XXX.  Failing to include non-adherents in a drug trial XXX.  When generalizing to everyday lives of lay participants XX not telling others that we were wrong is not affected by perverse incentives?  Incentives are usually in the form of accountability to others, rather than financial (Lerner and Tetlock).  An incentive to reach a particular conclusion, as in the case of the perverse incentive condition, can lead to interesting behavior.  The person may want to maintain an ``illusion of objectivity'' (Pyszczynski and Greenberg, 1987; Kruglanski, 1980) and not realize they are biased.  Or they may abandon their beliefs entirely and just try to win the game by deceiving the other person.  There is evidence that participants will generate different hypotheses, evaluate them differently, choose different rules of inference and implement them differently, choose and evaluate different evidence, and share different evidence with others (Pyszczynski and Greenberg, 1987).  is false, disappointing the researcher, and because they make a less convincing report, disappointing the audience.  Because of the pressure to generate a correct theory, and convince one's audience that it is real (genuine or not), a hypothesis can be elevated from one of many solutions to the only solution, preventing researchers from accurately evaluating and effectively sharing data that support or refute it.  Experimental evidence supports this, finding that incentives to solve an insight task can harm performance by promoting more vigorous application of bad strategies (Bonner, Hastie, Sprinkle, \& Young, 2000; Glucksberg, 1962; McGraw \& McCullers, 1979), and observational research shows that pressures to publish \cite{fanelli2010pressures}, and competition \cite{ioannidis2005early}, are associated with selective reporting.  The combination of the desire to discover and convince create an environment at high risk for distorted reasoning and selective reporting \cite{kunda1990case}.  

\section{Experiment Four}

Experiment Four provides a new participant with either all of the trials collected by a participant in Experiment Three, or only the trials this person decided to share.

\subsection{Method}

\subsubsection{Participants}  Carnegie Mellon University undergraduates completed the task for 5 dollars. There were X women and average age of 

\subsubsection{Design}

The between-subjects manipulation provided participants in Experiment Four with either the full set of trials the initial participant conducted, or only the trials the original participant decided to share at the end of the task.

In a matched pairs design, two participants in Experiment Four were matched with each participant in Experiment Three 100 participants from the original study.

Thus, the design was a 2 x (full or shared data) by N (original participant's dataset) by 2 (incentive condition of original participant) between subjects design.

Participants from Experiment Three were selected using a random sequence generator from random.org.  A sequence of random integers was generated, selecting the first 15 for inclusion.



\subsubsection{Materials}
Several participants were omitted because their final answers did not make sense.
All participants were told the following: 

\begin{flushleft}
The editor was then given the researcher's shared or full trials along with the researcher's final answer (see appendix 2 for the editor's instructions).  
\end{flushleft}

\begin{flushleft}
The editor was then asked on a seven point scale (1=very unlikely, 7=very likely):
\end{flushleft}

\begin{quote}
In your opinion, how likely is it that this person's Final Answer is correct?
\end{quote}

\begin{flushleft}
They were then given a ``betting bonus round'':
\end{flushleft}

\begin{quote}
Now you can earn bonus money by betting on the Final Answer of the person.

The person's Final Answer was given a score from zero to five based on how correct it was.  Higher scores mean that the person was closer to the Actual Rule.  A score of five means that the person got the Actual Rule exactly.

You have 5 dollars (500 pennies) to bet. You can bet each penny on whether you think the person's Final Answer received a score of zero, one, two, three, four, or five. Only one of the scores is correct.  You will get one cent (0.01 dollars) for each penny you assign to the correct score.  For example, suppose you bet 50 pennies on zero. If the person's Final Answer score is zero then you get 0.50 dollars. 
\end{quote}

\begin{flushleft}
Finally, they tried to guess the actual rule themselves:
\end{flushleft}

\begin{quote}
What do YOU think the correct rule is (it can be mathematical or in words, please be as specific as possible)?  You can receive up to 5 dollars depending on how close your guess is to the Actual Rule.
\end{quote}

\subsection{Results}
\subsubsection{Full or Shared Data}
controlling for total trials
guess score
prevous participant accuracy

\subsubsection{Perverse vs. Compatible}
controlling for total trials
guess score
prevous participant accuracy

\subsubsection{Disconfirmations Shared}
controlling for total trials
Proportion of disconfirmation shared as covariate
guess score
prevous participant accuracy

\subsubsection{Error Attributions Shared}
controlling for total trials
Proportion of error attributions shared as covariate
guess score
prevous participant accuracy

\subsubsection{Actual Errors Shared}
Proportion of actual errors shared as covariate
controlling for total trials
guess score
prevous participant accuracy

\subsection{Discussion}

Might need to use a different initial triple and Actual rule (e.g., where the initial triple DNF the rule). 

experimental evidence suggests that those receiving data can infer whether they come from a knowledgeable or malicious source [You'll need to set up sharing-communication as a topic.  Currently, it appears first as the second clause in a complex sentence without an antecedent.] \cite{shaftolearning,shafto2008teaching,shaftoepistemic}, no evidence exists on the decision-making process of the person sharing the data. 

It is the greatest sin of all to overlook flaw in data that we like, as it is in nobody's interest to do so (that nature No shame article). 

However, it is rarely clear that these two assumptions are met in situations scientists usually face, where positive results are often the only ones that are reported (Fanelli, 2012), and researchers do not know the prevalence or magnitude of bias that results. Instead, selective reporting is likely to harm others, either by causing illusory patterns to emerge in the data that remain, or failing to stop mistakes from being repeated. 

Decisions about whether to share seemingly flawed data occur in a context where the livelihood of researchers depend on producing convincing discoveries.  These pressures can improve error identification by encouraging deeper, broader, and more complex evaluation of the data \cite{kunda1990case,tetlock1987accountability}.  Pressures can also harm judgment by blinding researchers to alternative perspectives that, if taken, would have made faults in hypotheses apparent when appropriate \cite{arkes1991costs,bonner2000review,glucksberg1962influence,mcgraw1979evidence}.  Optimism about the likelihood of verifying a flashy hypothesis may make falsifying data more likely to be perceived as faulty \cite{krizan2007influence}.  Consistent with this, observational research suggests incentives, in the form of pressures to publish, are associated with suppression of falsifying data from publication \cite{fanelli2010pressures}, and failures to replicate basic research \cite{ioannidis2005early}.

Creating and communicating high quality research requires scientists to defend against hordes of cognitive and motivational monsters.  Collected data may fail to separate important theories \cite{doherty1979pseudodiagnosticity,fischhoff1983hypothesis}, or not establish reliable methods when error is possible \cite{gorman1992simulating}.  Informative data, once selected, do not prevent biased inferences, such as incorrect conclusions about causality, or unwarranted omission of alternative hypotheses \cite{fischhoff1983hypothesis}.  Accurate inferences on informative data do not preclude failures to share the right data with others.  Unexpected, null, and unwanted data \cite{kunda1990case} are probably rarely shared \cite{fanelli2010pressures}, as they don't help one's career \cite{collins1975seven}, seem meaningless, and are perceived to be symptoms of flawed methods rather than refutations of incorrect theories \cite{dunbar2001scientific,gorman1989error,gorman1986possibility}.   

Coherence is a widely accepted normative criterion for judgment.  The normative strength of coherence comes from two consequences of having coherent beliefs: 1) A person with coherent (Bayesian) beliefs cannot be given a series of bets that guarantees a loss of money (a dutch book; \citeNP{danks2008explaining}); 2) coherence implies convergence of one's beliefs to the truth in the long run (of course, under some rather stringent assumptions: a finite dimensional parameter space; the truth is in the support of the prior; \citeNP{danks2008explaining,diaconis1986consistency,schulte1999logic}.

Not only does Bayesian coherence have normative appeal, but it has been successfully applied to many areas in engineering and cognition as a general framework for induction \cite<e.g.,>{griffiths2009theory,kemp2008discovery}.  For example, subjects in the Wason card selection task \cite{oaksford2007bayesian} can be seen as selecting a card so as to maximize the information gained from the selection with respect to subjective posterior distributions.  In fact, attribution of unexpected results to error would be predicted by a simple coherent Bayesian model, where one expects verification more than falsification ex ante and error is independent of the truth of a hypothesis.

Correspondence, or alternatively ecological rationality \cite{gigerenzer2004fast}, on the other hand, is the relationship between judgment and reality.  Judgment that is non-Bayesian but correct is not coherent but is ecologically rational.  Correspondence can be measured using signal detection theory, using True Positives, True Negatives, False Positives and False Negatives \cite{coombs1970mathematical}.  Along with this, an overall measure of correlation for contingency judgments, the phi coefficient, increases when either True Positives and True Negatives increase, and decreases when either False Positives and False Negatives increase \cite{cohen1983applied}.

Finally, we are interested in how data sharing policies affect the success of scientific knowledge creation.  Those receiving data can do no better in prediction than having all of the data \cite{shalizi2001computational}.  If both the sharer and receiver of data are altruistic and very perceptive, then it is possible to reduce noise and attention costs by sharing less than the complete set of data.  This is called \emph{pedagogical sampling} \cite{shafto2008teaching}.  However, bounded rationality and lack of common knowledge about each others' motives and beliefs may make pedagogical sampling difficult, if not impossible. 

Finally, scientists need to learn from others, in part by sharing data in the form of published papers.  A person receiving the data can effectively learn if she knows the data were not maliciously or haphazardly shared \cite{shaftolearning}.  It is rarely clear that these two assumptions are met.  Falsifying data may not seem informative to others, and as a result not worth sharing.  This applies to any scientific field on the ``frontier of knowledge'' where the meaning and usefulness of data are never clear \cite[p. 47]{de2006normal}.  In  the face of this uncertainty, scientists have to decide whether anomalous, noisy, or unexpected data should be `cleaned' and removed from publication, or whether this would constitute cooking the books.  Our hypothesis is that falsifying data are seen as due to error, uninformative, and not worth sharing with others.  We call this the \emph{differential diagnosticity} hypothesis, as affirmation is seen as diagnosing a true hypothesis more than falsifying data diagnoses a false one.

How is it possible that participants behaved in a coherent manner according to their subjective probabilities, but were also inaccurate?  Although their error attributions were consistent with their subjective beliefs ($\phi$ = 0.39) and their subjective beliefs corresponded to reality, \emph{r} = 0.39, 95\% CI [0.15, 0.64], \emph{t} (161) = 3.12, \emph{p} $<$ 0.05, the Bayesian model prescribes thresholds for judging error that are too extreme ($P(TFTR)>0.8$ for DNF and $P(TFTR)< 0.2$ for fit).  As a result, a subjective Bayesian given the beliefs of these participants would not predict actual error well, as indicated by a low association between an ideal Bayesian given the beliefs of the subjects in this experiment and actual error, $\chi^{2}(1)$ = 0.031, $\phi$ = 0.014.  Participants would, and did, do better in terms of identifying actual error to not act subjectively Bayesian.

Actual errors were in the opposite direction.  Pooling all participants together, there was a higher actual error rate for affirming (20/70; 95\% CI [0.19, 0.40]) than falsifying (8/93; 95\% CI [0.04, 0.14]) feedback.  This seems coincidental, if the errors were distributed randomly.  If I'm interpreting this correctly, then I'd move this down, as an aside.  That would connect the overall behavioral result in the first sentence with the elaboration in the rest of the paragraph. This is because they, on average, expected that their triples fit the rule more often than not (mean = 57\%, median = 50\%).  However, in reality, they only picked triples that fit the rule 35\% of the time.  Thus, if they were accurate in their perception of their rate of picking triples that fit the rule (35\%), they should have expected more errors in affirming rather than falsifying feedback, and not the reverse.  How is accuracy here related to accuracy below.

Experiment One did not support the differential diagnosticity hypothesis; participants were equally likely to share feedback judged as error and not error.  There are four problems with the method used in Experiment One.  First, the sample size was small enough such that any failure to detect an effect may have been due to low statistical power.  Second, participants had a hard time differentiating sharing trials from sharing triples, and this confusion resulted in some participants not having usable data.  Third, participants completed the sharing judgments after a long (30 minute) period of hypothesis testing, and thus they may have been tired or inattentive by this time, a maturation threat to internal validity \cite{shadish2002experimental}.  Finally, when deciding to publish data, it may be more likely that researchers decide whether data is publishable after each experiment is conducted rather than after the entire series of experiments is finished.  This is an external validity concern. 

The present experiment reveals a case where participants are coherent but not accurate.  This pattern is the opposite of what would be expected from the Heuristics and Biases approach of \citeA{tversky1974judgment}, who propose both incoherence and inaccuracy, or the Fast and Frugal Heuristics approach \cite{gigerenzer2004fast} that is mute on incoherence but maintains accuracy.

The validity of these error attributions has important implications for how science is taught and practiced.  Valid judgments indicate that researchers, by training or intuition, are equipped to spot and remove error from data.  Invalid judgments tell us that education and debiasing may be needed to help guide intuitions about experimental error, and that actions based on error attributions should be cautious.  

There are at least two ways error attributions made by researchers can be valid.  If data that refute a hypothesis are actually faulty, and consistent data actually flawless, then it is justified to attribute error to them accordingly.  Attributions of error are also valid if they are consistent with prior beliefs, according to Bayes' Rule.  This latter view has been supported recently by Bayesian psychologists (\citeNP<e.g.,>{oaksford2007bayesian,griffiths2009theory}) who argue that hypothesis testing and causal judgments are descriptively Bayesian, for a variety of tasks, including Wason's card selection task and Wason's 2-4-6 rule discovery task.  

Whether the data are consistent with one's hypothesis is not the only determinant of data sharing.  Data inconsistent with a theory are unwanted, both because they suggest the theory  [First use of discovery, which seems like a surprising perspective, given the paper’s opening with a view of scientists as not looking for discoveries.  Seems like you want something more mundane, like theories.  General point: Once you’re done formulating your ideas, go back and check that every term has a clearly introduced antecedent.  These problems should be mostly avoidable.]




\begin{table}[h]
\begin{tabular}[h]{c c c c c}
Measure & Condition & \% Share All & \% Share None & Mean Share \% \\ \hline
Falsification & Compatible & 
0.33 & 
0.12 & 
0.7\\
Error & Compatible & 
0.24 & 
0.22 & 
0.45\\
Final Error & Compatible & 
0.37 & 
0.27 & 
0.62\\
Falsification & Perverse & 
0.47 & 
0.16 & 
0.78\\
Error & Perverse & 
0.35 & 
0.2 & 
0.67\\
Final Error & Perverse & 
0.33 & 
0.2 & 
0.57\\ \hline
\end{tabular}
\end{table}

\subsection{Individual-Level Sharing Policies}

Scammer: omission of non actual error that inconsistent with Final Answer.

The positive test strategy could be used to avoid collecting evidence that one doesn't wnat to have to share.  Could be to compare mean number of H+ trials proposed.

Many participants had no opportunity or reason to scam, as they proposed H+ trials, got affirmation, then quit.  This may be one strategy: collect just enough data to be convincing and quit before anything goes wrong.

Could it be that peope who start with a bad hypothesis (bad intuition; participant (54)) end up getting confused, screwed, and seduced?  Seems like people are particularly vulnerable at the beginning, especially if they propose an unlucky triple, or get error right away.

Maybe they thought that the more information they share, the more confident the participant would be in their final answer, regardless of whether the trials shared matched their final answer.

Because they don't know their final answer during the task, those in the perverse incentive condition may have just decided to share all the trials because of ambiguity.  At the end of the task, some of them realized they could scam, and did.

Motivated reasoning among scammers?  Were their final error attributions among scammers bad?

An alternative strategy to not sharing negative data is to not look for negative data.  Do perverse and compatible conditions differ in the number of H+ and H- tests?  For example, in the perverse condition just collect a few trials that are expected to be affirming and share them?

To understand the individual-level variation better, I looked at individual level data sharing policies.  Coding them for scammer or not scammer.  It seems like at the end of the task participants found it easiest to just share all trials??  A participant was counted as a scammer if they omitted trials that were inconsistent wth their final answer that were not errors, ethical if they included all inconsistent trials, and neither if they never collected a trial that was inconsistent with their final answer.

A confluence of factors would lead them to share them all trials at the end of the task: 1) laziness, 2) frustration, 3) seeing them all as valuable, 4) taking an ethical stance.  They did not seem, at the end of the task, to expect cleaning the data to be helpful. 

During the task, however, these factors are not at play; only the incentive is different.

\subsubsection{Participant 1}
Participant one completed three trials.  This person's final answer was ``take the first number, double that, then add two''.  This does match the person's behavior in the task, where they attributed error to a false feedback on (30,32,34), then seemed to change their hypothesis to the final answer, proposed that triple, and stuck with it.  This person only made one judgment at the end of the task, which was to share the third trial that fit their final answer.  This person was in the perverse incentive condition, and thus seemed to hide nevative data by not making judgments (NA) on trials that she did not want to share.  Might it be that people who scam me (by completing few trials) also scammed the other person?

\begin{table}[h]
  \begin{tabular}{c c c c c c}
    Num1 & Num2 & Num3 & Feedback & Share & Error \\ \hline
    22 & 24 & 26 & FIT & Share & Accurate \\
    30 & 32 & 34 & DNF & No & Error \\
    4 & 8 & 10 & FIT & Share & Accurate \\ \hline
\end{tabular}
\end{table}

\subsubsection{Participant 2}
This person completed sixteen trials.  This person seemed to start out with the usual ascending consecutive evens (12,14,16).  Tested H- triples of (5,7,9) and (7,9,11), both gave the expected answer (DNF), and shared them as they were not seen as error. Went to (102,104,106) and got DNF, then attributed it to error.  Continued to share triples that confirmed the ascending consecutive evens, attributing error to stuff inconsistent with this.  Came on the boundary again (98,100,102) but seemed to respect it the second time, including it in the final answer, scoring a 4.  The person shared all trials at the end, rather than sticking with the original sharing judgments (lazyness?).  Made only one final error attribution to (34,36,38), which was an error.  This person was in the pervserse condition. 

\subsubsection{Participant 3}
This person started with an uncommon inital rule of multiply by two then add two.  First proposing (8,10,12), getting affirmation, then attributing it to error, next proposing (14,30,62) getting falsification and attributing it to error. This person could not collect one trial that supported her hypothesis, but retained it throughout.  At the end fo the task, this person shared no trials, attributing all of them to error.  Why would this person not share any trials?  Frustration, laziness?

\subsubsection{Participant 4}
This person seemed to start with ascending, consecutive evens as the initial hypothesis, but proposed an H- trial initially (25,27,29), getting falsification, not sharing it, and attributing it to error (very weird).  The person then proposed a series of H+ trials, omitting one from sharing that was an error (56,58,60).  The next H- test (57,59,61) was not seen as an error this time when it came up disconfirming, but (33,35,37) was.  At the end of the task, the person just shared all of the data, and made correct erorr attributions (56,58,60), but not (56,58,62).  This person also missed error on (98,100,102) and (1002,1004,1006), both affirming.  The final answer was very good, ascending consecutive evens starting at 2.  Compatible condition.

\subsubsection{Participant 5;246810sm}
This participant collected a lot of trials, with many different hypotheses/tests.  The person started out with (1,7,9) a very strange one, got positive feedback, and shared it.  Actually, this person seemed to decide from the very beginning that she was going to share all of the trials.  This person did not complete the final attributions and sharing. 

\subsubsection{Participant 6}
This person started with some version of ascending consecutive evens, proposing (10,12,14) first, receiving disconfirmation, attributing it to error and not sharing. This person continued several more trials with affirming feedback.  I don't htink this person was a scammer because the (8,10,12) that was omitted at the end was an obvious error from this person's perspective.  Multiples of two in sequential order was the final answer.

\subsubsection{Participant 7}
This person seemed to have the standard initial hypotehsis.  Omitted one trial from sharing that was an error (12,14,16; DNF) and one that was also an error (3,5,7,FIT) but failed to attribute the latter to error, but then subsequently did with (7,9,11, DNF) twice, attributing both to error.  Seemed to misunderstand the error attribution?  Corrected the eror attributions at the end of the task, but omitted disocnfirming trials that supported the initial hypothesis (7,9,11; DNF).  Final answer was add two to each number starting with an even.

\subsubsection{Participant 8}
This person proposed (4,8,12) initially, obviously an unusual hypotehsis.  This trial was falsifying, attributed to error, and not shared.  The rest of the trials were (2,4,6), and all shared except one falsifying/error trial.  At the end, this person omitted a (2,4,6) error and a (4,8,12; DNF). The final answer seemed to be ascending consecutive evens starting from two.  The final answer was hard to follow.

\subsubsection{Participant 9}
Started with a usual hypothesis and triple (8,10,12).  Moved to the boundary (100,102,104; DNF) and saw it as error, but still shared it.  Got another actual error (30,32,34; DNF), now seeing it not as error.  Gets back on track with more ascending consecutive evnes, then hits a bump (72,74,76;DNF), correctly identifying it as eror.  Hits the boundary again (102,104,106; DNF) initially not sharing it, but then sharing it in a replication.  At the end, the person just shared everyhting, even two correct attributions to error.  Sequential even numbers less than 100 was proposed, a very good answer.  So why share all trials at the end of the task?

\subsubsection{Participant 10}
This person started out with a usual hypothesis (8,10,12;DNF) attributing it to error and not sharing. Proceded normally with a few H- trials (1,3,5; DNF) not attributed to error, as expected.  Tried a downward sequence (10,6,8; FIT) and correctly attributeed it to error, then got the correct feedback (10,6,8;DNF) and didn't attribute it to error.  Hit the upper boundary first (100,102,104;DNF) attributing it to error, then replicating (100,102,104;DNF) and changing the theory.  Downward triples to be allowable (88,86,84; FIT) attributing it to error, but then again replicating (86,84,82;DNF) not attributing it to error.  During the task, the person shared all the trials except the first one.  I have no explanation for why.  At the end of the task this person omitted almost every trial attributed to error and was accurate on almost all of them.  Final answer was single or double digit consecutive ascending evens, almost an exact score.

\subsubsection{Participant 11}
This person conducted five trials, all H+ and affirming, no errors, and shared them all.  Same at the end of the task. Proposed ascending consecutive evens.

\subsubsection{Participant 12}
This person did one trial (8,16,24, DNF) shared it then quit.  No final judgments, no final answer.

\subsubsection{Participant 13}
This person completed a lot of trials.  Started out in the usual way with (6,8,10; DNF) and sharing it.  Hit two errors in a row (10,12,14;DNF) and (4,6,8;DNF), correctly attributing both to error and sharing neither.  Kept H+ trials until three errors in a row with (50,52,54;DNF) correctly attributing all to error and sharing none.  I don't understand the final sharing policy at all.  Many affirming trials with no error were omitted, and many trials correctly attributed to error were included.

\subsubsection{Participant 14}
Five trials, all H+ for usual hypothesis, with one error on the fourth trial (8,10,12;DNF) correctly identified.

\subsubsection{Participant 15}
Conducted many trials.  Started off with some weird hypotheses (6,12,18; DNF), not shared but not seen as error.  Then another weird one (155,157,159;DNF), this time shared and attributed to error; very weird.  Several more weird ones (10,12,16;DNF) not seen as error and not shred, then (220,222,224;DNF) seen as error and not shared.  Then a more normal one (68,70,72;DNF) not seen as error but shared. Then on to the hypothesis of 2x then 2x+2 (8,16,18;FIT) and shared.  Then tried it again (100,200,202;DNF) now seeing this as error.  At this point the participant is very conused, and gets two more errors in a row that support her hypothesis (26,72,74;FIT) and (40,80,82;FIT) both shared, making the person think that x,2x,2x+2 is correct.  Trying again, though, suddenly it stops working  (17,34,36;DNF) and (36,72,74;DNF) not shared and attributed to error.  Then the person seems to give up on the x,2x,2x+2 after two more failures (35,70,72;DNF) and (4,8,16;DNF) both seen as valid and both shared.  This person moves to $x,x^{2},x^{2}+2$ getting more disconfirmation (4,16,18;DNF), then an error (4,16,6;FIT) which was shared.  Another failure of the new hypothesis (9,81,83;DNF) attributed to error, then an other unfortunate error for a new hypthesis $x,x^{2},x+2$ which no longer fits the initial triple.  These failures continue until the end.  The person finally goes with $x, 2x, 2x+2$ as a Final Answer.  The final data sharing policy seems to perfectly omit trials that do not fit this hypothesis.  A clear scamming attempt.

\subsubsection{Participant 16; easterbunny}
Completed a lot of trials.  Except for the first trial, shared all trials during the task.  No idea why.  Final answer was $x,x^{2}+2,x^{2}+2$ I think.  Clearly scammed at the end.

\subsubsection{Participant 17;queen24}
Collected only five trials, seemed to start with a weird hypothesis (7,15,18;FIT), and seemed to go off track right away.  Attributed all trials to error and shared them all.  This person got 4 errors on five trials!  Shared everything again at the end of the task.  Nonsense final answer.  Don't think this person had any idea what was going on.

\subsubsection{Participant 18;DJJ4230MTURK}
Again, from the very beginning shares all trials.  Starts with a usual hypothesis (10,12,14;DNF) and correctly attributes it to error. Sticks with the ascending evens, making one weird error attribtion (8,10,12;FIT; error), two more correct ones (20,22,24;DNF) twice.  Shared everything again at the end of the task. Made two strange error attributions at the end of the task (7,9,11;DNF) and (15,17,19;DNF), especially with a final answer of ascending consecutive evens.

\subsubsection{Participant 19}
Collected only three trials (1,3,5;DNF), (6,12,14;DNF), (3,5,7; FIT).  Called the first two errors and didn't share, and shared the last one.  Didn't complete the end of the task and proposed a final answer that was consistent with the only trial shared $x,x+3-1,x+3+3-1-1$.

\subsubsection{Participant 20}
Started with a weird hypothesis (3,6,9;FIT) and shared it.  Then tried to replicate twice (3,6,9; DNF) and failed both times, sharing it the second time.  Then proposed a more usual triple (8,10,12;FIT) and shared it, then tried (3,6,9;DNF), shared it, but also attributed it to error.  This person's final answer was evens, and this mostly matched the data shared in the end.

\subsubsection{Participant 21;291284}
Shared all trials from the very start.  Started with (10,12,14;FIT) and shared it, then went to (5,7,9;DNF) and shared it, but also attributed it to error (must have misunderstood).  Then came to the lower bound but got an error (0,2,4,FIT) and shared it.  Went to (1000,1002,1004;DNF) and shared it but also attributed it to error.  At the end this person just shared all the data again.

\subsubsection{Participant 22}
Started with a weird hypothesis (3,5,7;DNF) shared it and attributed it to error.  Then went to (10,12,14;FIT) and shared it.  Went back to (1,3,5;DNF) and saw it as error and dint' share. Then (3,5,7;DNF) saw it as error but shared it.  Strange.  Just shared everythign at the end of the task and made no error attributions, I guess not trying by then.  Every number increases by two was the final answer.

\subsubsection{Participant 23}
This person basically did 5 H+ trials, got affirmation on all and shared and stopped. Got no errors.  No opportunity to scam.

\subsubsection{Participant 24}
Very similar to 23.  5 H+ trials, all confirming, shared all, no errors.  Ascending evens.  No opportunity for scamming.

\subsubsection{Participant 25}
Person collected only 3 trials, starting with  H+ trials, sharing them, then getting (20,40,60;DNF) sharing it and ending.  Didn't make final attributions.  Proposed increasing evens by two.

\subsubsection{Participant 26}
Started with H+ (10,12,14;FIT) shared it.  Continuing then omitting one H+ (8,10,12;FIT) possibly becuase of reducndancy.  Then proposed (14,24,38;DNF) and shared it, then (4,6,12;DNF) and shared it.  Then got two errors (10,12,14;DNF) and (8,10,12;DNF) and didn't share them.  At the end the person omitted two erros that were both actual errors.

\subsubsection{Participant 27}
Started with an H+ (8,10,12;FIT) shared it but also attributed it to error. Go some H- (28,50,60;DNF) and (16,30,40;DNF), but didn't share the latter.  This person didn't share falsifying trials, incorrectly attributin gthem to error, but they weren't inconsistent with the final answer.

\subsubsection{Participant 28}
This person shared all trials right from the start getting H+ trials but with three errors.  The person correctly identified all three errors.  Again shared all trials at the end of the task.

\subsubsection{Participant 29;2003492395935}
Shared all trials right from the start.  Began with H+ (8,10,12;FIT).  Got some falsifying trials and incorrectly attributed the mto error (must have misunderstood).  Again shared everything at the end. 

\subsubsection{Participant 30;29881}
Matt Goetz: A3VY8B6YM3T5D6

Began with weird triple (12,14,18;DNF) shared it then followed up with (2,8,10;DNF).  Shared all trials. Still shared all trials at end of task, and made weird erro attributions.  Went with even numbers below 20, which explains the eror attributions.

\begin{quote}
``My thinking was to establish a narrow band of options first which continued my assumed pattern forward, and then progressively widening the gaps of that band. I'd occasionally repeat the narrow band to refresh my certainty (such as it was) with my assumption. The larger gaps and skewed numbers were a way of testing that theory at the outside edge.''
  
  ``After the initial stage, I thought I would share information I'd presented in the hopes of furthering the success of anyone who attempted to discover the rule after me.  After all, my sharing can help others and has no negative impact on me. Why wouldn't I, at that point, be happy to share my results?''
\end{quote}

  

\subsubsection{Participant 31}
Just proposed the same triple (8,10,12;FIT) five times and shared all.  Shared all at the end.

\subsubsection{Participant 32}
Collected some H+ trials, but omitted some at the end that seemed to be redundant.  Also omitted one actual error.  At the end, shared everything except the one error.

\subsubsection{Participant 33}
Started with an odd triple (4,16,20;DNF) shared it.  Followed up with another weird one (4,16,36;DNF) again shared it. Only didn't share one trial (10,20,30;DNF) which was incorrectly attributed to error. Shared everything at the end of the task, but also strangely attributed them all to error.  Possibly just clicked through.

\subsubsection{Participant 34;math1}
Mark Thoma: A3UTT0I3W37RSE 

Shared everything right from the start.  Made some correct erro attributions and stuck with ascending consecutive evens with a few H- trials.  Again shared everything at the end.

\begin{quote}
``Well, I tried to find the mathematically simplest rule that would work.  I think I started with the idea that the addition of 2 to the previous number was the rule, provided that the previous number was 0 or even.  Most of the time, that seemed to wok, though there were some times that it didn't.  But about 20\% of the tome the computer rating of FIT/DNF could be wrong.  So I reasoned (but didn't calculate) that my original assumption could be the rule/  I think I did try at least two other possibilities.  One was that the initial numbers given could be the square roots of other numbers.  But that didn't seem to work.  I think I also tried using the ``add two'' rule to odd numbers, but that didn't work.

  I also thought about things like: using a base other than base 10, using logs, etc. but didn't investigate those possibilities.

  I shared everything because, not knowing if the FIT/DNF response by the computer was correct, I didn't want to deliberately bias the info I passed on by being selective.  
\end{quote}


\subsubsection{Participant 35}
All H+ triples, with a few errors correctly attributed to error and not shared.  Same strategy at end of task, omitting actual errors.

\subsubsection{Participant 36}
Three H+ trials (8,10,12) but the last one was DNF which was not shared.  Then did (6,6,6;DNF) and nto shared, and (6,3,0.75;FIT) and shared.  Bizarre final answer that is descending.  Seemed to try to scam but failed.

\subsubsection{Participant 37}
Tried different arrangements of the numbers 2,4,6.  Shared all trials except a replication of (4,6,2;DNF).  Shared everything at the end.  No final answer.  Didn't seem to understand the task.

\subsubsection{Participant 38;9841510784}
Shared everything.  Started with a weird triple (2,2,2;DNF), then went to normal H+ ones with one correct error attribution after (12,14,16;DNF).  Shared everything at the end again.

\subsubsection{Participant 39;3804MT}
Pam Wong: A2WP0DRQ2QU7MH

Shared everything from the start.  Started with (2,4,6;DNF) and attributed it to error.  Then went to (6,4,2;DNF) attributing it to erorr, then (1,2,3;DNF) but not attributing it to error.  Then finished with three replciations of (2,4,6;FIT).  Again shared eveyrthign at the end, with only 1 error attribution that was correct.


\subsubsection{Participant 40}
Shared everything from the start with all H+ triples.  Only one false feedback at the end that was not attributed to error (6,8,10;DNF). No final judgments.

\subsubsection{Participant 41}
Shared everything from the start.  All H+ triples with two errors but weren't identified.  Shared all at the end of the task and correctly identified both the errors.

\subsubsection{Participant 42;85143}
Started with (8,10,12;FIT) was shared.  Then (4,6,10;DNF) shared, then (4,6,10;FIT) shared.  Seemed to be moving around in hypotheses (8,10,18;DN) seen as error, then (10,20,30;FIT) shared.  (15,15,30;DNF) not shared and seen as error.  Kept proposing triples that didn't fit and incorrectlya ttributing them to error. then finally gets to (22,24,26;FIT) and seems to settle on ascneding evens.  At the end of the task just shared everything, still making incorrect eror attributions.

\subsubsection{Participant 43;141950subramanian}
Sidhu Ram

Shared all trials with two correct error attributions on (8,10,12;DNF) and (10,12,14;DNF).  Shared everything at end of task again.

\subsubsection{Participant 44;670557}
Shared everything from the beginning.  Began with H+ (20,22,24;FIT) then hit the boundary (108,110,112;DNF) thought it was an error but still shared it.  Got a real error (16,18,20;DNF) but identified it.  Shared all at the end of the task.

\subsubsection{Participant 45;A2IDLKSKCZTIGS}
Usual H+ (8,10,12;FIT) shared, then hit the boundary but thought it was an error (100,102,104;DNF) but still shared it, then (2,4,6;FIT) but didn't share it, then two more shares.  Shared everything at the end.

\subsubsection{Participant 46;mturk123mturk}
Started with H+ (8,10,12;FIT), then got (3,5,7;DNF) and didnt think it was error.  Then went (-2,0,2;FIT) and shared it, nto seeing error, then (-8,-6,-4;DNF) thinking it was error and not sharing. Then (-4,-2,0;DNF) but now shared although seen as error.  Hit the boundaries twice (-2,0,2;DNF) and (102,104,108;DNF) and shared both.  Got a few errors that weren't shared.  No end of task judgments.  Almost got the rule with $n,n+2,n+4 where 2>=n<=92$.

\subsubsection{Participant 47;2468}
Shared everything from the start, with one incorrect error attribution (5,10,15;DNF).  Shared everything at the end with all five correct error attributions. 

\subsubsection{Participant 48;rothris119}
Shared all trials but got the first (6,8,10;DNF) correctly attributed to error and shared.

\subsubsection{Participant 49;44290}
Participant went with usual trisl (4,6,8;FIT) with a couple disconfirming (3,6,9;DNF) initially perceived to be error then not.  Didn't share an error (2,4,6;DNF) and (4,6,8;DNF).  Same data sharing pattern at the end.

\subsubsection{Participant 50;20015a}
Usual pattern (14,16,18;FIT) with one no share (16,18,20;FIT) not sure why.  Same sharing pattern at the end of the task.

\subsubsection{Participant 51;thousandarms97}
Usual pattern withh one error (22,24,26;DNF) correctly identified and not shared. Same sharing at the end.

\subsubsection{Participant 52;7346332}
Shared all.  No errors, no error attributions.  Same at the end.

\subsubsection{Participant 53;A1Q0QCHCD2YOIR}
Michael Burke: A1Q0QCHCD2YOIR
Seemed to be a sneaky person.  Only conducted five trials all the same (8,10,12;FIT).

\begin{quote}
``It's been a while. From what I remember:

I was given the same three numbers in every instance (2, 4, 6). I think the rule I input was "even numbers, multiples of 2" or something similar. I knew I could get cute, and change the rule (i.e., 2 times 1, times 2, times 3, times 4, etc.), but I felt it was best to make sure that the rule was easy to understand, and in all honesty, easy to type. Thus, even numbers, increasing by 2.

The instructions had stated that I'd get twenty tries to figure out the rule, and that I'd also get three practice tries. And that the practice round "answers" were the same as the actual trials. I honestly expected a more difficult set of numbers. Since I was told at the onset that the practice was the same as the actual, and right off the bat I got the practice question right, then it was just a matter of doing the trial a couple more times to make sure that I didn't get a false computer reading.

Getting "FIT" six times from the "computer" was sufficient enough (in my opinion) that I wasn't getting false answers from the computer. Even though I didn't answer "100%" on the slider reflecting whether I thought I was right, by turn five I was fairly certain that I was correct in determining the rule. I knew that there was a chance of there being a trap, and that if I continued doing all the trials one or two later trials could show up as false from the computer, but as mentioned I was feeling fairly confident that I figured out the rule.

I can't remember the requirements for the Share function, assume it's whether or not to share my three numbers (8, 10, 12) with another person so they could figure out the rule themselves? In any event I'm usually prone to sharing, so by default would probably have selected share without thinking too much about it.''
\end{quote}


\subsubsection{Participant 54;ea082329}
Conducted a lot of trials.  Started with a bad hypothesis (3,6,9;DNF) and shared it then ocnontinued with H- triples.  Ended up completely unable to get affirmation except for erros, which were difficult to identify.  Ended up sharing most trials.  Ended up just sharing eveyrthing at the end.

\subsubsection{Participant 55;szcluvrk}
Shelly Zhao: A2WP0DRQ2QU7MH

Shared everything.  Started with (8,10,12;FIT) but thought it was erorr, then (3,5,7;DNF) shared, then perceived one more error at the boundary (1000,10002,10004;DNF).  Also shared everything at the end.

\begin{quote}
``Sure. I believe the numbers I got were 2, 4, 6, 8 or the similar pattern. I started by wondering if it was even numbers in sequence (Trial 1 and 2). I tried odd numbers to see if it was limited to even numbers or just a difference of 2, which also seemed to work. I tried large numbers to make sure it was a +2 pattern, or if it might be a x2 for the second number and +2 for future numbers. (since 2+2=4 and 2x2=4 as well)''
``I believe I decided to share all of them because I didn't have a preference to share/withhold information. I thought if I could assist someone else to find the pattern, that would be great!'' 
``I mean I did not have a strong preference to withhold information, so I decided to go ahead and share the results I had. I did not feel particularly strongly about sharing or withholding the information, so thought if I could help others with my results, I may as well share them.''

\end{quote}

\subsubsection{Participant 56;3796js}
Joel Sprague
Usual beginning (6,8,10;FIT;Share).  Eventually got (27,29,31;DNF) no share and (41,43,45;FIT) no share, error.  Shared everythign at the end.

\begin{quote}
``I seem to remember something about the task saying that the more trials you use, the worse your score?  It was either that, or I was trying to minimize the amount of trials just to keep from confusing who I shared it with.  I honestly don't remember which it was in particular, but it was one of those.  Honestly, there was a little bit also of just wanting to guess the rule with as few trials as possible, just out of pride.

As to the sharing, that was relatively straightforward since I only got the one false positive, as far as I could tell.  I made sure not to share the false negative, as I didn't want to confuse whomever I shared trials with.  Everything that looked right, I shared, and marked as correct, just so I could give the other person the best information possible.  Of course, that was dependent upon them trusting that I knew the task and was giving them the best results.''
\end{quote}

\subsubsection{Participant 57;A1E8QOIXZ0AS2Y}
Started normally (8,10,12;FIT) but then got (4,6,8;DNF) correctly identified as error, then (10002,10004,10004;DNF) seen as error and not shared.  Very strange.  Then a few other strange trials.  Proposed that the triples are increasing by two.  Definitely scammed.

\subsubsection{Participant 58;Max2012}
Usual start with one falsifier at the end not shared (4,8,12;DNF).  Didn't finish the task.

\subsubsection{Participant 59;xxx}
Preview response.

\subsubsection{Participant 60;113240}
Abdul Aseez; A2SNIL2ZLU5A3W

All H+ trials with one error right away (6,8,10;DNF).  All shared.  May have been sneaky stopper.  Shared everything at the end

\subsubsection{Participant 61;87128VEPS}
Shared all except one trial (8,10,12;FIT).  Shared all again at the end except for one (7,9,11;FIT) and correctly attributed to error.

\subsubsection{Participant 62;33396}
All falsifications, all shared, none attributed to error. No H+ trials wer eproposed, ecxcept for first trial (12,14,16;FIT).

\subsubsection{Participant 63;730113}
Neji Earth: A2SNIL2ZLU5A3W

Shared all.  Started weird (6,2,8;DNF) shared it. Then got (3,6,9;FIT).  Then (3,6,9;DNF) and thought it was eror.  Then (8,10,12;FIT) and thought it was error.  Shared everything at the end and incorrectly identified one error.

\subsubsection{Participant 64;A3IQ4T5TA613X}
started with (12,14,16;FIT), moved to (6,12,18;DNF).  Came on he boundary (102,104,106;DNF) and though ti was an error and didn't share. Had quite a strange sharing pattern.  Very strange hypotehsis $a<100, 10a+2,10a+4,10a+6$.  Omitted several actual errors.

\subsubsection{Participant 65;sunitha01}
Possibly sneaky (4,6,8;FIT).  Shared all.  No final judgments.

\subsubsection{Participant 66;97200319}
Shared all from the beginning.  Mostly H+ trials with a few errors, most of them missed though.  Shared all at the end.

\subsubsection{Participant 67;missklp17}
All H+ trials, shared all.  One incorrect error attribution ant the beginning (10,12,14;FIT).  No error attributions at the end.

\subsubsection{Participant 68;cbartok20}
Started with trouble (4,8,12;DNF) and (1,4,10;DNF) not shared.  Got back on track with (4,6,8;FIT).  Hit the boundary with (102,104,106;DNF) shared but (102,106;110;DNF) not shared.  A couple more trials not shared.  This person scammed by omitting (102,104,106;DNF) at the end which didn't fit their hypothesis (ascending by 2) and was incorrectly attributed toe rorr, same thing with (5,7,9;DNF) was omitted and attributed to error.

\subsubsection{Participant 69;A2NY4882G4701C}
 	
Shared all.  All confirming except two crossed the boundary (102,104,106; DNF) attributed to error and (202,204,206;FIT) shared.  Same pattern at the end with the same incorrect attribution.

\subsubsection{Participant 70;051188}
Shared all trials.  A few H- trials with DNF feedback, but no errors and none identified.  All shared at the end.

\subsubsection{Participant 71;kim65}
Shared all except one correct error attribution (88,90,92;DNF) and missed one error (102,104,106;FIT).  Same at the end. 

\subsubsection{Participant 72;daisy14}
Tiffany Kilby: A351P2O6QGTCS

This person generally ommitted trials attributed to error, which were frequently right.  A few mistakes were (1,3,5;DNF) attributed to error, but then it seemed like the person realized odds don't fit the rule.  Final answer was even numbers.  This person omitted disconfirming trials that fit the rule (1,3,5;DNF).  Doesn't seem like scamming, but not sure why they would omit.

\begin{quote}
``I started off thinking about what I thought the rule could be based on the 3 numbers given at the beginning. Then I tried the different numbers to see if my theory was correct. When I put in numbers that justified what I thought the rule was it made me think that I was correct. But when I put in numbers that did not fit the rule that I was thinking of I tried to think of any other rule that would fit. When I couldn't think of any other rule I remembered that it said that it could give incorrect results for some of them and I figured that those ones must have been incorrect. So when I was completely certain of what I thought the rule was I stopped the trials and decided to make my guess.''
``I was pretty confident on what I thought the rule was so any trial that went against what I thought the rule was I didn't want to share because I did not want to confuse them.''

\end{quote}

\subsubsection{Participant 73;0765}
dfost1965@aol.com: A351P2O6QGTCS

Shared all trials. Started with a weird one (0,1,8;DNF) then (3,6,9;FIT) but seen as error.  Then an error (8,10,12;DNF) not identified.  Shared everything at the end also, making no error attributions, although there was one.  Some very H- triples. Final answer was nonsense.

\subsubsection{Participant 74;jaa246}
Shared all from the beginning.  All H+ trials.  A few errors which were not identified, and two descending trials (24,16,10;DNF) and (8,4,6;DNF).  Omitted the one corect erro attribution at the end.

\subsubsection{Participant 75;0912781413}
Started with (12,24,36;FIT), so probably got confused. Proposed (12,24,36;DNF) twice, attributing it to error the first time but not the second, sharing neither trial.  Consecutive multiples of two.

\subsubsection{Participant 76;howdy}
Proposed just a few trials (8,10,12;FIT), with one (8,10,12;DNF) attributed to errro correctly, and (6,4,2;DNF).  All shared.

\subsubsection{Participant 77;I returned the hit on accident}
Complicated.  Started with H- error (3,6,9;FIT) attributed to error but shared.  Ended up only sharing (2,4,6) trials. Weird final answer ``Enter 2,4,6 and then switch off entering 0 then 100 percent''.

\subsubsection{Participant 78;Js8021980}
Jennifer Smith: A1LKMVVMEOH1ZG 

Complicated again. Started with (5,7,9;DNF) saw it as error but shared it, then (1,3,4;DNF) shared, (4,6,8;DNF) shared but seen as error correctly.  Then on to more traditional triples.  Then shared all triples at the end of the task.

\subsubsection{Participant 79;917648531564}
All H+, all affirming, all shared.

\subsubsection{Participant 80;246}
Pretty normal set.  Began with (4,6,8;FIT).  Did not share (12,16,20;DNF) but not attributed to error.  Shared all at the end, attributed none to error, which was correct. 

\subsubsection{Participant 81;cal123}
All H+ trials except for at the boundary at the end.  Didn't share one trial (14,18,22;DNF) at the end, incorrectly attributing it to error, but this wasn't inconsistent with the final answer.  Seems like went with ascending evens. 

\subsubsection{Participant 82;la120390}
Started with (1,3,5;DNF) attribued to error and not shared, then (8,10,12;FIT) then (8,10,12;DNF) then (1,3,5;DNF) not shared.  Only shared one trial at the end (8,10,12;FIT).  Although omitted most trials, none were inconsistent with the final answer, except a correctly identified error. 

\subsubsection{Participant 83;aassaai}
No trials.

\subsubsection{Participant 84;walterross}
All shared.  Only one H- trial at the end (10,11,12;DNF).  All shared at the end.

\subsubsection{Participant 85;1029381}
Started as usual (10,12,14;FIT).  Hit an actual error (4,6,8;DNF). Did one H tests in a rwo seeing both as error (11,13,15;DNF) and sharing neither.  Hit the boundary (100,103,105;DNF) and did not share and hit it again (100,102,104;DNF) again not sharing it.  Final answer was add two to previous number, but (11,13,15;DNF) was omitted, scamming.

\subsubsection{Participant 86;SUBAN2845}
Very short. Two trials of (2,4,6;FIT) and sharing then done.  Same at end of task.

\subsubsection{Participant 87; 9221}
All (2,4,6;FIT) trials.  All shared.  End of task omitted (8,10,11;DNF).

\subsubsection{Participant 88; 1bb650f402...}
Shared everything right from the start.  Had trouble identifying the upper bound.

\subsubsection{Participant 89; ramsey}
H+ triples, no false feedback.  Didn't compelte end of task.

\subsubsection{Participant 90; 12456}
All shared from the begining.  Only (2,4,6) proposed.  Even numbers.

\subsubsection{Participant 91; 905671}
All affirming triples except (98,100,102;DNF).  All shared.

\subsubsection{Participant 92; peteydog1}
All H+ trials, all affirming, all shared.

\subsubsection{Participant 93; scoobydoobydoo}
Affirming then came on some triples that didn't want to share.  Didn't share two tials at the end (8,12,20;DNF) and (10,20,190;DNF).  The final answer was add2.  Neither of the omitted trials was inconsistent.

\subsubsection{Participant 94; 1111}
All H+, all affirming all shared except (20,22,24:DNF) but it was shared.  All shared at the end.

\subsubsection{Participant 95; 1121}
Some disconfirming trials not shared.  Did not complete the final attributions.  None of the omitted trials were inconsistent with the final answer.

\subsubsection{Participant 96; wehaddababyitsaboy}
All shared, but a variety of H- trials done.  Three omitted at the end.  One was an error (8,10,12;DNF).  What was an error (-50,-48,-46;FIT) and one was not an error (1000,1002,1004;DNF).  This last one was inconsistent with the final answer $2n where n>2$, making this a scamming.

\subsubsection{Participant 97; 56537}
A variety of trials, omitting lots of them.  Omissions at the end fo the task.  Final answer was multiples of 2 (Evens).  Omissions at the end of the task were not inconsistent with the final answer.

\subsubsection{Participant 98; Daisy43517}
Sandra Elliot: AY0BJCGY2DHIX 

All shared, with a mixture of H+ and a few H- trials at the beginning.  Did not complete final answer.

\subsubsection{Participant 99; allie23}
All H+ trials not sharing a few errors correctly identified.

\subsubsection{Participant 100; 159645}
Viji Lakshmi: A2U01ZSNRCSV3V

All H+ trials, mostly (2,4,6). All shared except one error.



First, participants may have not comprehended the perverse incentive.  However, open-ended comprehension checks before the task began indicated that participants fully grasped the perverse incentive.  For example, when asked how they and the participant they are sharing trials with earn money, a typical response was:
\begin{quote}
``The way that you earn bonus money is by making the other person think that my Final Answer matches the Actual Rule.''
``The other person earns bonus money based on how well they judge if my Final Answer matches the Actual Rule.''
\end{quote}

Many participants also discontinued their participation prematurely, explaining that the rule was too simple, not realizing that they had not identified it.  For example:

  \begin{quote}
``If the rule did not seem initially clear to me, I would have gone through more trials.  However, the number of trials I did and the results, given the instructions, seemed adequate to successfully complete the task.''%$
\end{quote}

  Those who quit prematurely, proposing only a few trials, also proposed only trials that they expected to receive affirmation, and received only affirmation, except for rare errors that they were highly accurate in identifying.  This confound limited participants' chance of obtaining disconfirming feedback.  As disconfirming feedback is necessary for selective reporting, this confound causes the experiments to underestimate its magnitude.  
